An Optimized K-Means Algorithm of Reducing Cluster Intra-dissimilarity for Document Clustering

Wang, Daling; Yu, Ge; Bao, Yubin; Zhang, Meng

doi:10.1007/11563952_81

An Optimized K-Means Algorithm of Reducing Cluster Intra-dissimilarity for Document Clustering

Daling Wang¹⁹,
Ge Yu¹⁹,
Yubin Bao¹⁹ &
…
Meng Zhang¹⁹

Conference paper

795 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3739))

Abstract

Due to the high-dimension and sparseness properties of documents, clustering the similar documents together is a tough task. The most popular document clustering method K-Means has the shortcoming of its cluster intra-dissimilarity, i.e. inclining to clustering unrelated documents together. One of the reasons is that all objects (documents) in a cluster produce the same influence to the mean of the cluster. SOM (Self Organizing Map) is a method to reduce the dimension of data and display the data in low dimension space, and it has been applied successfully to clustering of high-dimensional objects. The scalar factor is an important part of SOM. In this paper, an optimized K-Means algorithm is proposed. It introduces the scalar factor from SOM into means during K-Means assignment stage for controlling the influence to the means from new objects. Experiments show that the optimized K-Means algorithm has more F-Measure and less Entropy of clustering than standard K-Means algorithm, thereby reduces the intra-dissimilarity of clusters effectively.

This work is supported by National Natural Science Foundation of China (No. 60173051)

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dobrynin, V., Patterson, D., Rooney, N.: Contextual Document Clustering. In: ECIR 2004, pp. 167–180 (April 2004)
Google Scholar
Hang, X., Dai, H.: An Immune Network Approach for Web Document Clustering. In: WI 2004, pp. 278–284 (November 2004)
Google Scholar
Honkela, T.: Description of Kohonen’s Self-Organizing Map 1998-1-2, http://www.mlab.uiah.fi/~timo/som/thesis-som.html
Hotho, A., Staab, S., Stumme, G.: WordNet Improves Text Document Clustering. In: SIGIR 2003 Semantic Web Workshop (October 2003)
Google Scholar
Hung, C., Wermter, S., Smith, P.: Hybrid Neural Document Clustering Using Guided Self-Organizing and WordNet. IEEE Intelligent System 19(2), 68–77 (2004)
Article Google Scholar
Hussin, M., Kamel, M., Nagi, M.: An Efficient Two-Level SOMART Document Clustering Through Dimensionality Reduction. In: ICONIP 2004, pp. 158–165 (November 2004)
Google Scholar
Kantrowitz, M., Mohit, B., Mittal, W.: Stemming and its Effects on TFIDF Ranking. In: SIGIR 2000, pp. 357–359 (July 2000)
Google Scholar
Li, X., Yu, G., Wang, D., Bao, Y.: ESPClust: An Effective Skew Prevention Method for Model-Based Document Clustering. In: CICLing 2005, pp. 735–745 (February 2005)
Google Scholar
Modha, D., Spangler, W.: Feature Weighting in k-Means Clustering. Machine Learning 52(3), 217–237 (2003)
Article MATH Google Scholar
Niu, Z., Ji, D., Tan, C.: Document Clustering Based on Cluster Validation. In: CIKM 2004, pp. 501–506 (November 2004)
Google Scholar
Russel, B., Yin, H., Allinson, N.: Document Clustering Using the 1 + 1 Dimensional Self-Organizing Map. In: IDEAL 2002, pp. 154–160 (August 2002)
Google Scholar
Zheng, X., Liu, W., He, P., Dai, W.: Document Clustering Algorithm Based on Tree-Structured Growing Self-Organizing Feature Map. In: ISNN, vol. (1), pp. 840–845 (2004)
Google Scholar
Zhuang, L., Dai, H.: A Maximal Frequent Itemset Approach for Web Document Clustering. In: CIT 2004, pp. 970–977 (November 2004)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Science and Engineering, Northeastern University, Shenyang, 110004, P.R.China
Daling Wang, Ge Yu, Yubin Bao & Meng Zhang

Authors

Daling Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yubin Bao
View author publications
You can also search for this author in PubMed Google Scholar
Meng Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Edinburgh & Bell Laboratories,
Wenfei Fan
College of Computer Science, Zhejiang University, 310027, Hangzhou, Zhejiang, China
Zhaohui Wu
Dept. of E. I. E, Huazhong University of Science and Technology, Wuhan, China
Jun Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, D., Yu, G., Bao, Y., Zhang, M. (2005). An Optimized K-Means Algorithm of Reducing Cluster Intra-dissimilarity for Document Clustering. In: Fan, W., Wu, Z., Yang, J. (eds) Advances in Web-Age Information Management. WAIM 2005. Lecture Notes in Computer Science, vol 3739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563952_81

Download citation

DOI: https://doi.org/10.1007/11563952_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29227-2
Online ISBN: 978-3-540-32087-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics