A parallel text document clustering algorithm based on neighbors

Li, Yanjun; Luo, Congnan; Chung, Soon M.

doi:10.1007/s10586-015-0450-z

A parallel text document clustering algorithm based on neighbors

Published: 07 April 2015

Volume 18, pages 933–948, (2015)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Yanjun Li¹,
Congnan Luo² &
Soon M. Chung³

431 Accesses
14 Citations
Explore all metrics

Abstract

In this paper, we propose a new parallel algorithm for text document clustering based on the concept of neighbor (Guha et al. in Inf Syst 25(5):345–366, 2000). If two documents are similar enough, they are considered as neighbors of each other. The new algorithm is named parallel k-means based on neighbors (PKBN), and it is a parallel version of sequential k-means based on neighbors (SKBN) that we proposed in Luo et al. (Data Knowl Eng 68(11):1271–1288, 2009). PKBN fully exploits the data-parallelism of SKBN and adopts a new parallel pair-generating method to build the neighbor matrix. Our new parallel pair-generating method causes less communication overhead between processors than existing methods. PKBN is designed for message-passing multiprocessor systems and is implemented on a cluster of Linux workstations to analyze its performance. Our experimental results on real-life data sets demonstrate that PKBN is very efficient and has good scalability with respect to the number of processors and the size of data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aboutabl, A.E., Elsayed, M.N.: A novel parallel algorithm for clustering documents based on the hierarchical agglomerative approach. Int. J. Comput. Sci. Inf. Technol. (IJCSIT). 3(2), 152–163 (2011)
Bobda, C., Steenbock, N.: Singular value decomposition on distributed reconfigurable systems. In: Proceedings of the 12th International Workshop on Rapid System Prototyping, pp. 38–43 (2001)
Brent, R.P., Luk, F.T.: The solution of singular-value and symmetric eigen-value problems on multiprocessor arrays. SIAM J. Sci. Stat. Comput. 6, 69–84 (1985)
Article MATH MathSciNet Google Scholar
Cao, Z., Zhou, Y. : Parallel text clustering based on MapReduce. In: Proceedings of the 2nd International Conference on Cloud and Green Computing, pp. 226–229 (2012)
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Large-Scale Parallel Data Mining, LNCS, vol. 1759, pp. 245–260. Springer, Heidelberg (2000)
Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. SIGKDD Explor. Newsl. 2(2), 34–38 (2000)
Article Google Scholar
Garey, M.R., Johnson, D.S., Witsenhausen, H.S.: Complexity of the generalized Lloyd–Max problem. IEEE Trans. Inf. Theory 28(2), 256–257 (1982)
Article MathSciNet Google Scholar
Garg, A., Mangla, A., Gupta, N., Bhatnagar, V.: PBIRCH: a scalable parallel clustering algorithm for incremental data. In: Proceedings of the International Database Engineering and Applications Symposium, pp. 315–316 (2006)
Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)
Article Google Scholar
Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)
MATH Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
MATH Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Book Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). Information about RCV1. http://www.daviddlewis.com/resources/testcollections/rcv1/ Accessed 20 Oct 2014
Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Comput. 11(3), 275–290 (1989)
Article MATH MathSciNet Google Scholar
Li, Y., Chung, S.M.: Parallel bisecting K-means with prediction clustering algorithm. J. Supercomput. 39(1), 19–37 (2007)
Article MATH Google Scholar
Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)
Li, Y., Luo, C., Chung, S.M.: Text clustering with feature selection by using statistical data. IEEE Trans Knowl. Data Eng. 20(5), 641–652 (2008)
Article Google Scholar
Liu, G., Wang, Y., Zhao, T., Li, D.: Research on the parallel text clustering algorithm based on the semantic tree. In: Proceedings of the 6th International Conference on Computer Sciences and Convergence Information Technology, pp. 400–403 (2011)
Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data Knowl. Eng. 68(11), 1271–1288 (2009)
Article Google Scholar
Mogill, J.A., Haglin, D.J.: Toward parallel document clustering. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS) Workshops and PhD Forum, pp. 1700–1709 (2011)
Moore, A.W.: The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pp. 397–405 (2000)
Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21, 1313–1325 (1995)
Article MATH MathSciNet Google Scholar
Ordonez, C., Omiecinski, E.: Efficient disk-based K-means clustering for relational databases. IEEE Trans. Knowl. Data Eng. 16(8), 909–921 (2004)
Article Google Scholar
Ranka, S., Sahni, S.: Clustering on a hypercube multicomputer. IEEE Trans. Parallel Distrib. Syst. 2(2), 129–137 (1991)
Article Google Scholar
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. KDD Workshop on Text Mining (2000)
Zhang, Y., Sun, J., Zhang, Y., Zhang, X.: Parallel implementation of CLARANS using PVM. In: Proceedings of the 2004 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1646–1649 (2004)
Zhao, Y., Karypis, G.: Comparison of agglomerative and partitional document clustering algorithms. Technical Report# TR 02-014, Department of Computer Science, University of Minnesota, Minneapolis (2002)

Download references

Author information

Authors and Affiliations

Department of Computer and Information Science, Fordham University, Bronx, NY, 10458, USA
Yanjun Li
Teradata Corporation, San Diego, CA, 92127, USA
Congnan Luo
Department of Computer Science and Engineering, Wright State University, Dayton, OH, 45435, USA
Soon M. Chung

Authors

Yanjun Li
View author publications
You can also search for this author in PubMed Google Scholar
Congnan Luo
View author publications
You can also search for this author in PubMed Google Scholar
Soon M. Chung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soon M. Chung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Luo, C. & Chung, S.M. A parallel text document clustering algorithm based on neighbors. Cluster Comput 18, 933–948 (2015). https://doi.org/10.1007/s10586-015-0450-z

Download citation

Received: 25 February 2014
Revised: 21 October 2014
Accepted: 20 March 2015
Published: 07 April 2015
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10586-015-0450-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A parallel text document clustering algorithm based on neighbors

Abstract

Access this article

Similar content being viewed by others

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

A survey on neural topic models: methods, applications, and challenges

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A parallel text document clustering algorithm based on neighbors

Abstract

Access this article

Similar content being viewed by others

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

A survey on neural topic models: methods, applications, and challenges

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation