Skip to main content
Log in

A parallel text document clustering algorithm based on neighbors

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In this paper, we propose a new parallel algorithm for text document clustering based on the concept of neighbor (Guha et al. in Inf Syst 25(5):345–366, 2000). If two documents are similar enough, they are considered as neighbors of each other. The new algorithm is named parallel k-means based on neighbors (PKBN), and it is a parallel version of sequential k-means based on neighbors (SKBN) that we proposed in Luo et al. (Data Knowl Eng 68(11):1271–1288, 2009). PKBN fully exploits the data-parallelism of SKBN and adopts a new parallel pair-generating method to build the neighbor matrix. Our new parallel pair-generating method causes less communication overhead between processors than existing methods. PKBN is designed for message-passing multiprocessor systems and is implemented on a cluster of Linux workstations to analyze its performance. Our experimental results on real-life data sets demonstrate that PKBN is very efficient and has good scalability with respect to the number of processors and the size of data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Aboutabl, A.E., Elsayed, M.N.: A novel parallel algorithm for clustering documents based on the hierarchical agglomerative approach. Int. J. Comput. Sci. Inf. Technol. (IJCSIT). 3(2), 152–163 (2011)

  2. Bobda, C., Steenbock, N.: Singular value decomposition on distributed reconfigurable systems. In: Proceedings of the 12th International Workshop on Rapid System Prototyping, pp. 38–43 (2001)

  3. Brent, R.P., Luk, F.T.: The solution of singular-value and symmetric eigen-value problems on multiprocessor arrays. SIAM J. Sci. Stat. Comput. 6, 69–84 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  4. Cao, Z., Zhou, Y. : Parallel text clustering based on MapReduce. In: Proceedings of the 2nd International Conference on Cloud and Green Computing, pp. 226–229 (2012)

  5. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)

  6. Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Large-Scale Parallel Data Mining, LNCS, vol. 1759, pp. 245–260. Springer, Heidelberg (2000)

  7. Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. SIGKDD Explor. Newsl. 2(2), 34–38 (2000)

    Article  Google Scholar 

  8. Garey, M.R., Johnson, D.S., Witsenhausen, H.S.: Complexity of the generalized Lloyd–Max problem. IEEE Trans. Inf. Theory 28(2), 256–257 (1982)

    Article  MathSciNet  Google Scholar 

  9. Garg, A., Mangla, A., Gupta, N., Bhatnagar, V.: PBIRCH: a scalable parallel clustering algorithm for incremental data. In: Proceedings of the International Database Engineering and Applications Symposium, pp. 315–316 (2006)

  10. Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)

    Article  Google Scholar 

  11. Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)

    MATH  Google Scholar 

  12. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  13. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)

    Book  Google Scholar 

  14. Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)

  15. Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). Information about RCV1. http://www.daviddlewis.com/resources/testcollections/rcv1/ Accessed 20 Oct 2014

  16. Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Comput. 11(3), 275–290 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  17. Li, Y., Chung, S.M.: Parallel bisecting K-means with prediction clustering algorithm. J. Supercomput. 39(1), 19–37 (2007)

    Article  MATH  Google Scholar 

  18. Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)

  19. Li, Y., Luo, C., Chung, S.M.: Text clustering with feature selection by using statistical data. IEEE Trans Knowl. Data Eng. 20(5), 641–652 (2008)

    Article  Google Scholar 

  20. Liu, G., Wang, Y., Zhao, T., Li, D.: Research on the parallel text clustering algorithm based on the semantic tree. In: Proceedings of the 6th International Conference on Computer Sciences and Convergence Information Technology, pp. 400–403 (2011)

  21. Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data Knowl. Eng. 68(11), 1271–1288 (2009)

    Article  Google Scholar 

  22. Mogill, J.A., Haglin, D.J.: Toward parallel document clustering. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS) Workshops and PhD Forum, pp. 1700–1709 (2011)

  23. Moore, A.W.: The anchors hierarchy: using the triangle inequality to survive high dimensional data. In: Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pp. 397–405 (2000)

  24. Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21, 1313–1325 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  25. Ordonez, C., Omiecinski, E.: Efficient disk-based K-means clustering for relational databases. IEEE Trans. Knowl. Data Eng. 16(8), 909–921 (2004)

    Article  Google Scholar 

  26. Ranka, S., Sahni, S.: Clustering on a hypercube multicomputer. IEEE Trans. Parallel Distrib. Syst. 2(2), 129–137 (1991)

    Article  Google Scholar 

  27. van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979)

    Google Scholar 

  28. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. KDD Workshop on Text Mining (2000)

  29. Zhang, Y., Sun, J., Zhang, Y., Zhang, X.: Parallel implementation of CLARANS using PVM. In: Proceedings of the 2004 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1646–1649 (2004)

  30. Zhao, Y., Karypis, G.: Comparison of agglomerative and partitional document clustering algorithms. Technical Report# TR 02-014, Department of Computer Science, University of Minnesota, Minneapolis (2002)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soon M. Chung.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Luo, C. & Chung, S.M. A parallel text document clustering algorithm based on neighbors. Cluster Comput 18, 933–948 (2015). https://doi.org/10.1007/s10586-015-0450-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-015-0450-z

Keywords

Navigation