SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Bagirov, Adil; Seifollahi, Sattar; Piccardi, Massimo; Zare Borzeshi, Ehsan; Kruger, Bernie

doi:10.1007/978-3-031-23804-8_25

Adil Bagirov⁸,
Sattar Seifollahi^9,10,
Massimo Piccardi⁹,
Ehsan Zare Borzeshi¹⁰ &
…
Bernie Kruger¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13397))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

296 Accesses

Abstract

Given a large unlabeled document collection, the aim of this paper is to develop an accurate and efficient algorithm for solving the clustering problem over this collection. Document collections typically contain tens or hundreds of thousands of documents, with thousands or tens of thousands of features (i.e., distinct words). Most existing clustering algorithms struggle to find accurate solutions on such large data sets. The proposed algorithm overcomes this difficulty by an incremental approach, incrementing the number of clusters progressively from an initial value of one to a set value. At each iteration, the new candidate cluster is initialized using a partitioning approach which is guaranteed to minimize the objective function. Experiments have been carried out over six, diverse datasets and with different evaluation criteria, showing that the proposed algorithm has outperformed comparable state-of-the-art clustering algorithms in all cases.

S. Seifollahi—Currently working at Resolution Life (Australia). This work was performed while at the University of Technology Sydney.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Measurement of clustering effectiveness for document collections

Article Open access 10 January 2022

An Analytical Approach to Document Clustering Techniques

Clustering Performance Analysis

References

Arthur, D., Vassilvitskii, S.: $k$-means++: the advantages of careful seeding. In: Gabow, H. (ed.) Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms [SODA 2007], Philadelphia, pp. 1027–1035 (2007)
Google Scholar
Bagirov, A.M.: Modified global $k$-means algorithm for minimum sum-of-squares clustering problems. Pattern Recogn. 41(10), 3192–3199 (2008)
Article Google Scholar
Bagirov, A.M., Ugon, J., Webb, D.: Fast modified global $k$-means algorithm for incremental cluster construction. Pattern Recogn. 44(4), 866–876 (2011)
Article Google Scholar
Bai, L., Liang, J., Sui, C., Dang, C.: Fast global $k$-means clustering based on local geometrical information. Inf. Sci. 245, 168–180 (2013)
Article Google Scholar
Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using Von Mises-Fisher distributions. J. Mach. Learn. Res. 6, 1345–1382 (2005)
Google Scholar
Blei, D., Griffiths, T., Jordan, M.I., Tenenbaum, J.: Hierarchical topic models and the nested chinese restaurant process. Adv. Neural. Inf. Process. Syst. 16(106), 168–180 (2004)
Google Scholar
Buckley, C., Lewit, A.F.: Optimizations of inverted vector searches. In: SIGIR 1985, pp. 97–110 (1985)
Google Scholar
Dhillon, S., Fan, J., Guan, Y.: Efficient clustering of very large document collections. In: Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, Oxford (2001)
Google Scholar
Erra, U., Senatore, S., Minnella, F., Caggianese, G.: Approximate TF-IDF based on topic extraction from massive message stream using the GPU. Inf. Sci. 292, 143–161 (2015)
Article Google Scholar
Harman, D.: Overview of the first text retrieval conference (TREC-1). In: Proceedings of the First Text Retrieval Conference (TREC-1), pp. 1–20. DIANE Publishing (1979)
Google Scholar
Hartigan, J.A., Wong, M.A.: A $k$-means clustering algorithm. Appl. Stat. 28, 100–108 (1979)
Article Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Article Google Scholar
Kogan, J.: Introduction to Clustering Large and High-dimensional Data. Cambridge University Press, Cambridge (2007)
Google Scholar
Kowalski, G.: Information Retrieval Systems - Theory and Implementation. Kluwer Academic Publishers, Dordrecht (1997)
Google Scholar
Lai, J.Z.C., Huang, T.-J.: Fast global $k$-means clustering using cluster membership and inequality. Pattern Recogn. 43(5), 1954–1963 (2010)
Article Google Scholar
Lewis, D.D.: Reuters-21578 text categorization collection distribution 1.0 (1997). http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Liu, Y., Xiao, S., Lv, X., Shi, S.: Research on k-means text clustering algorithm based on semantic. In: Proceedings of 10th International Conference on Computing, Control and Industrial Engineering (CCIE 2010), vol. 1, pp. 124–127 (2010)
Google Scholar
Ma, J.: Improved k-means algorithm in text semantic clustering. Open Cybern. Syst. J. 8, 530–534 (2014)
Article Google Scholar
McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Inf. Retrieval 3(2), 127–163 (2000)
Article Google Scholar
Qimin, C., Qiao, G., Yongliang, W., Xianghua, W.: Text clustering using VSM with feature clusters. Neural Comput. Appl. 26(4), 995–1003 (2015)
Article Google Scholar
Rennie, J.: The 20 newsgroups data set (2008). http://qwone.com/jason/20Newsgroups, 1997
Ordin, B., Bagirov, A.M.: A heuristic algorithm for solving the minimum sum-of-squares clustering problems. J. Global Optim. 61, 341–361 (2015)
Article Google Scholar
Salton, S., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Retrieval. McGraw-Hill Book Company, New York (1983)
Google Scholar
Seifollahi, S., Bagirov, A., Layton, R., Gondal, I.: Optimization based clustering algorithms for authorship analysis of phishing emails. Neural Process. Lett. 1–15 (2017)
Google Scholar
Van Rijsbergen, C.J.: Information Retrieval, 2nd edition. Buttersworth, London (1989)
Google Scholar
WebKB: Available electronically at http://www.cs.cmu.edu/~WebKB
Yi, J., Zhang, Y., Zhao, X., Wan, J.: A novel text clustering approach using deep-learning vocabulary network. Math. Probl. Eng. 1, 1–13 (2017)
Google Scholar
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF-IDF, LSI and multi-words for text classification. Expert Syst. Appl. 38, 2758–2765 (2011)
Article Google Scholar

Download references

Acknowledgement

This project was funded by the Capital Market Cooperative Research Centre in combination with the Transport Accident Commission of Victoria. Acknowledgements and thanks to industry partner David Attwood (Lead Research Partnerships). This research has received ethics approval from University of Technology Sydney (UTS HREC REF NO. ETH16-0968).

Author information

Authors and Affiliations

Federation University Australia, Ballarat, VIC, Australia
Adil Bagirov
University of Technology Sydney, Ultimo, NSW, Australia
Sattar Seifollahi & Massimo Piccardi
Capital Markets Cooperative Research Centre, The Rocks, NSW, Australia
Sattar Seifollahi & Ehsan Zare Borzeshi
Transport Accident Commission (TAC), Geelong, VIC, Australia
Bernie Kruger

Authors

Adil Bagirov
View author publications
You can also search for this author in PubMed Google Scholar
Sattar Seifollahi
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Piccardi
View author publications
You can also search for this author in PubMed Google Scholar
Ehsan Zare Borzeshi
View author publications
You can also search for this author in PubMed Google Scholar
Bernie Kruger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sattar Seifollahi .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bagirov, A., Seifollahi, S., Piccardi, M., Zare Borzeshi, E., Kruger, B. (2023). SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2018. Lecture Notes in Computer Science, vol 13397. Springer, Cham. https://doi.org/10.1007/978-3-031-23804-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-23804-8_25
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23803-1
Online ISBN: 978-3-031-23804-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections