Skip to main content

An Optimized K-Means Algorithm of Reducing Cluster Intra-dissimilarity for Document Clustering

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3739))

Abstract

Due to the high-dimension and sparseness properties of documents, clustering the similar documents together is a tough task. The most popular document clustering method K-Means has the shortcoming of its cluster intra-dissimilarity, i.e. inclining to clustering unrelated documents together. One of the reasons is that all objects (documents) in a cluster produce the same influence to the mean of the cluster. SOM (Self Organizing Map) is a method to reduce the dimension of data and display the data in low dimension space, and it has been applied successfully to clustering of high-dimensional objects. The scalar factor is an important part of SOM. In this paper, an optimized K-Means algorithm is proposed. It introduces the scalar factor from SOM into means during K-Means assignment stage for controlling the influence to the means from new objects. Experiments show that the optimized K-Means algorithm has more F-Measure and less Entropy of clustering than standard K-Means algorithm, thereby reduces the intra-dissimilarity of clusters effectively.

This work is supported by National Natural Science Foundation of China (No. 60173051)

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dobrynin, V., Patterson, D., Rooney, N.: Contextual Document Clustering. In: ECIR 2004, pp. 167–180 (April 2004)

    Google Scholar 

  2. Hang, X., Dai, H.: An Immune Network Approach for Web Document Clustering. In: WI 2004, pp. 278–284 (November 2004)

    Google Scholar 

  3. Honkela, T.: Description of Kohonen’s Self-Organizing Map 1998-1-2, http://www.mlab.uiah.fi/~timo/som/thesis-som.html

  4. Hotho, A., Staab, S., Stumme, G.: WordNet Improves Text Document Clustering. In: SIGIR 2003 Semantic Web Workshop (October 2003)

    Google Scholar 

  5. Hung, C., Wermter, S., Smith, P.: Hybrid Neural Document Clustering Using Guided Self-Organizing and WordNet. IEEE Intelligent System 19(2), 68–77 (2004)

    Article  Google Scholar 

  6. Hussin, M., Kamel, M., Nagi, M.: An Efficient Two-Level SOMART Document Clustering Through Dimensionality Reduction. In: ICONIP 2004, pp. 158–165 (November 2004)

    Google Scholar 

  7. Kantrowitz, M., Mohit, B., Mittal, W.: Stemming and its Effects on TFIDF Ranking. In: SIGIR 2000, pp. 357–359 (July 2000)

    Google Scholar 

  8. Li, X., Yu, G., Wang, D., Bao, Y.: ESPClust: An Effective Skew Prevention Method for Model-Based Document Clustering. In: CICLing 2005, pp. 735–745 (February 2005)

    Google Scholar 

  9. Modha, D., Spangler, W.: Feature Weighting in k-Means Clustering. Machine Learning 52(3), 217–237 (2003)

    Article  MATH  Google Scholar 

  10. Niu, Z., Ji, D., Tan, C.: Document Clustering Based on Cluster Validation. In: CIKM 2004, pp. 501–506 (November 2004)

    Google Scholar 

  11. Russel, B., Yin, H., Allinson, N.: Document Clustering Using the 1 + 1 Dimensional Self-Organizing Map. In: IDEAL 2002, pp. 154–160 (August 2002)

    Google Scholar 

  12. Zheng, X., Liu, W., He, P., Dai, W.: Document Clustering Algorithm Based on Tree-Structured Growing Self-Organizing Feature Map. In: ISNN, vol. (1), pp. 840–845 (2004)

    Google Scholar 

  13. Zhuang, L., Dai, H.: A Maximal Frequent Itemset Approach for Web Document Clustering. In: CIT 2004, pp. 970–977 (November 2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, D., Yu, G., Bao, Y., Zhang, M. (2005). An Optimized K-Means Algorithm of Reducing Cluster Intra-dissimilarity for Document Clustering. In: Fan, W., Wu, Z., Yang, J. (eds) Advances in Web-Age Information Management. WAIM 2005. Lecture Notes in Computer Science, vol 3739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563952_81

Download citation

  • DOI: https://doi.org/10.1007/11563952_81

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29227-2

  • Online ISBN: 978-3-540-32087-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics