Abstract
In this paper, we propose a new parallel algorithm for text document clustering based on the concept of neighbor (Guha et al. in Inf Syst 25(5):345–366, 2000). If two documents are similar enough, they are considered as neighbors of each other. The new algorithm is named parallel k-means based on neighbors (PKBN), and it is a parallel version of sequential k-means based on neighbors (SKBN) that we proposed in Luo et al. (Data Knowl Eng 68(11):1271–1288, 2009). PKBN fully exploits the data-parallelism of SKBN and adopts a new parallel pair-generating method to build the neighbor matrix. Our new parallel pair-generating method causes less communication overhead between processors than existing methods. PKBN is designed for message-passing multiprocessor systems and is implemented on a cluster of Linux workstations to analyze its performance. Our experimental results on real-life data sets demonstrate that PKBN is very efficient and has good scalability with respect to the number of processors and the size of data set.
Similar content being viewed by others
References
Aboutabl, A.E., Elsayed, M.N.: A novel parallel algorithm for clustering documents based on the hierarchical agglomerative approach. Int. J. Comput. Sci. Inf. Technol. (IJCSIT). 3(2), 152–163 (2011)
Bobda, C., Steenbock, N.: Singular value decomposition on distributed reconfigurable systems. In: Proceedings of the 12th International Workshop on Rapid System Prototyping, pp. 38–43 (2001)
Brent, R.P., Luk, F.T.: The solution of singular-value and symmetric eigen-value problems on multiprocessor arrays. SIAM J. Sci. Stat. Comput. 6, 69–84 (1985)
Cao, Z., Zhou, Y. : Parallel text clustering based on MapReduce. In: Proceedings of the 2nd International Conference on Cloud and Green Computing, pp. 226–229 (2012)
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Large-Scale Parallel Data Mining, LNCS, vol. 1759, pp. 245–260. Springer, Heidelberg (2000)
Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. SIGKDD Explor. Newsl. 2(2), 34–38 (2000)
Garey, M.R., Johnson, D.S., Witsenhausen, H.S.: Complexity of the generalized Lloyd–Max problem. IEEE Trans. Inf. Theory 28(2), 256–257 (1982)
Garg, A., Mangla, A., Gupta, N., Bhatnagar, V.: PBIRCH: a scalable parallel clustering algorithm for incremental data. In: Proceedings of the International Database Engineering and Applications Symposium, pp. 315–316 (2006)
Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)
Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). Information about RCV1. http://www.daviddlewis.com/resources/testcollections/rcv1/ Accessed 20 Oct 2014
Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Comput. 11(3), 275–290 (1989)
Li, Y., Chung, S.M.: Parallel bisecting K-means with prediction clustering algorithm. J. Supercomput. 39(1), 19–37 (2007)
Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)
Li, Y., Luo, C., Chung, S.M.: Text clustering with feature selection by using statistical data. IEEE Trans Knowl. Data Eng. 20(5), 641–652 (2008)
Liu, G., Wang, Y., Zhao, T., Li, D.: Research on the parallel text clustering algorithm based on the semantic tree. In: Proceedings of the 6th International Conference on Computer Sciences and Convergence Information Technology, pp. 400–403 (2011)
Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data Knowl. Eng. 68(11), 1271–1288 (2009)
Mogill, J.A., Haglin, D.J.: Toward parallel document clustering. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS) Workshops and PhD Forum, pp. 1700–1709 (2011)
Moore, A.W.: The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pp. 397–405 (2000)
Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21, 1313–1325 (1995)
Ordonez, C., Omiecinski, E.: Efficient disk-based K-means clustering for relational databases. IEEE Trans. Knowl. Data Eng. 16(8), 909–921 (2004)
Ranka, S., Sahni, S.: Clustering on a hypercube multicomputer. IEEE Trans. Parallel Distrib. Syst. 2(2), 129–137 (1991)
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. KDD Workshop on Text Mining (2000)
Zhang, Y., Sun, J., Zhang, Y., Zhang, X.: Parallel implementation of CLARANS using PVM. In: Proceedings of the 2004 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1646–1649 (2004)
Zhao, Y., Karypis, G.: Comparison of agglomerative and partitional document clustering algorithms. Technical Report# TR 02-014, Department of Computer Science, University of Minnesota, Minneapolis (2002)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Y., Luo, C. & Chung, S.M. A parallel text document clustering algorithm based on neighbors. Cluster Comput 18, 933–948 (2015). https://doi.org/10.1007/s10586-015-0450-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-015-0450-z