Abstract
The goal of any clustering algorithm producing flat partitions of data, is to find both the optimal clustering solution and the optimal number of clusters. One natural way to reach this goal without the need for parameters, is to involve a validity index in a clustering process, which can lead to an objective selection of the optimal number of clusters. In this chapter, we provide two main contributions. Firstly, since validity indices have been mostly studied in a two or three-dimensionnal datasets, we have chosen to evaluate them in a real-world applications, document and word clustering. Secondly, we propose a new context-aware method that aims at enhancing the validity indices usage as stopping criteria in agglomerative algorithms. Experimental results show that the method is a step-forward in using, with more reliability, validity indices as stopping criteria.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999)
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, pp. 6–17 (2002)
Bezdek, J.C., Li, W., Attikiouzel, Y., Windham, M.P.: A geometric approach to cluster validity for normal mixtures. Soft Comput. 1(4), 166–179 (1997)
Chou, C.-H., Su, M.-C., Lai, E.: A new cluster validity measure and its application to image compression. Pattern Anal. Appl. 7(2), 205–220 (2004)
Christopher Manning, H.S.: Foundations of statistical natural language processing (1999)
Cimiano, P., Hotho, A., Staab, S.: Comparing conceptual, divise and agglomerative clustering for learning taxonomies from text. In: ECAI, pp. 435–439 (2004)
Davies, D.L., B.D.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2) (1979)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Willey & Sons (2001)
Dunn, J.C.: Well separated clusters and optimal fuzzy paritions. Journal Cybern. 4, 95–104 (1974)
Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998)
Greene, D., Cunningham, P.: Efficient prediction-based validation for document clustering. In: ECML, pp. 663–670 (2006)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering validity checking methods: Part ii. SIGMOD Record 31(3), 19–27 (2002)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)
Harris, Z.S.: Distributional structure (1985)
Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In: SIGIR, pp. 76–84 (1996)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 22(11), 1025–1034 (1973)
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: ICML, pp. 170–178 (1997)
Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44, 23–34 (1988)
Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions. Neural Comput. 16(6), 1299–1323 (2004)
Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1650–1654 (2002)
Michalski, R., Stepp, R., Diday, E.: A recent advance in data analysis: Clustering objects into classes characterized by conjuctive concepts. Progress in Pattern Recognition 1 (1983)
Miller, G.A.: Wordnet: A lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)
Pedersen, T., Kulkarni, A.: Selecting the “right” number of senses based on clustering criterion functions. In: EACL (2006)
Qiu, Y., Frei, H.-P.: Concept based query expansion. In: SIGIR 1993: Proc. of the 16th annual Int. ACM SIGIR Conf. on Research and development in information retrieval, pp. 160–169. ACM, New York (1993)
Raskutti, B., Leckie, C.: An evaluation of criteria for measuring the quality of clusters. In: IJCAI, pp. 905–910 (1999)
Rissanen, J.: Stochastic complexity in statistical inquiry. World Scientific Publishing Co., Singapore (1989)
Saitta, S., Raphael, B., Smith, I.F.C.: A bounded index for cluster validity. In: MLDM, pp. 174–187 (2007)
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Technical report, Ithaca, NY, USA (1987)
Sergios Theodoridis, K.K.: Pattern recognition. Academic Press, London (1999)
Shari Landes, R.I.T., Leacock, C.: Building semantic concordances, pp. 199–216 (1998)
Sharma, S.: Applied multivariate techniques. John Wiley and Sons, Chichester (1996)
Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: SIGIR, pp. 159–166 (2003)
Harabasz, C.T.: A dendrite method for cluster analysis. Communications in Statistics 3, 1–27 (1974)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistic. Technical report, Dept. of Statistics, Stanford University (2000)
Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive clustering of web documents. In: KDD, pp. 287–290 (1997)
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)
Zhao, Y., Karypis, G., Fayyad, U.M.: Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 10(2), 141–168 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
El Sayed, A., Hacid, H., Zighed, D. (2009). Exploring Validity Indices for Clustering Textual Data. In: Zighed, D.A., Tsumoto, S., Ras, Z.W., Hacid, H. (eds) Mining Complex Data. Studies in Computational Intelligence, vol 165. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88067-7_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-88067-7_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88066-0
Online ISBN: 978-3-540-88067-7
eBook Packages: EngineeringEngineering (R0)