Abstract
When solving many practical problems, we not only need sample labels given by a clustering algorithm, but also rely on the recognition of far-near relations of clusters. Under the difficult condition of many clusters in a high-dimensional data set, the clustering visualization methods based on dimension reductions usually produce the phenomena, e.g., some clusters are overlapping, interlacing, or pushed away; as a result, the far-near relations of some clusters are displayed wrongly or cannot be distinguished. The existing inter-cluster distance methods cannot determine whether two clusters are far away or near. The geometric double-entity model method (GDEM) is proposed to describe far-near relations of clusters, and the methods such as the relative border distance, absolute border distance and region dense degree are designed to measure far-near degrees between clusters. GDEM pays attention to both the absolute distance between nearest sample sets and the dense degrees of border regions of two clusters, and it is able to uncover accurately far-near relations of clusters in a high-dimensional space, especially under the difficult condition mentioned above. The experimental results on four real data sets show that the proposed method can effectively recognize far-near relations of clusters, while the conventional methods cannot.
Similar content being viewed by others
References
Xu R, Wunsch II D C. Survey of clustering algorithms. IEEE Trans Neural Netw, 2005, 16: 645–678
Frey B J, Dueck D. Clustering by passing messages between data points. Science, 2007, 315: 972–976
Armstrong S A, Staunton J E, Silverman L B, et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet, 2002, 30: 41–47
Han J, Kamber M. Data Mining: Concepts and Techniques. 2nd ed. San Francisco: Morgan Kaufmann Publishers, 2006. 300–301
Bolshakova N, Azuaje F. Cluster validation techniques for genome expression data. Signal Process, 2003, 83: 825–833
Wua K P, Wang S D. Choosing the kernel parameters for support vector machines by the inter-cluster distance in the feature space. Pattern Recogn, 2009, 42: 710–717
Yin F, Liu C L. Handwritten Chinese text line segmentation by clustering with distance metric learning. Pattern Recogn, 2009, 42: 3146–3157
Shamir R, Maron-Katz A, Tanay A, et al. EXPANDER-an integrative program suite for microarray data analysis. BMC Bioinformatics, 2005, 6: 232
Ren Y G. Study on data visualization methods and related techniques for clustering (in Chinese). Dissertation for Ph.D. Degree. Shenyang: Northeastern University, 2006
Zhan D C, Zhou Z H. Ensemble-based manifold learning for visualization (in Chinese). J Comput Res Develop, 2005, 42: 1533–1537
Sun M M. Study on theories and algorithms in manifold learning (in Chinese). Dissertation for Ph.D. Degree. Nanjing: Nanjing University of Science and Technology, 2007
Roweis S T, Saul L K. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000, 290: 2323–2326
Weinberger K Q, Sha F, Saul L K. Learning a kernel matrix for nonlinear dimensionality reduction. In: Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004. 839–846
van der Maaten L J P, Hinton G E. Visualizing high-dimensional data using t-SNE. J Mach Learn Res, 2008, 9: 2579–2605
Suykens J A K. Data visualization and dimensionality reduction using kernel maps with a reference point. IEEE Trans Neural Netw, 2008, 19: 1501–1517
Bishop C, Svensen M, Williams C. GTM: the generative topographic mapping. Neural Comput, 1998, 10: 215–234
Tino P, Nabney I. Hierarchical GTM: constructing localized nonlinear projection manifolds in a principled way. IEEE Trans Pattern Anal Mach Intell, 2002, 24: 639–656
Yin H. ViSOM-a novel method for multivariate data projection and structure visualisation. IEEE Trans Neural Netw, 2002, 13: 237–243
Wu S, Chow T. PRSOM: A new visualization method by hybridizing multidimensional scaling and self-organizing map. IEEE Trans Neural Netw, 2005, 16: 1362–1380
Wei H L, Billings S A. Feature subset selection and ranking for data dimensionality reduction. IEEE Trans Pattern Anal Mach Intell, 2007, 29: 162–166
Datta S, Datta S. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics, 2003, 19: 459–466
http://www.mathworks.com/matlabcentral/fileexchange/authors/24811
Radovanovic M, Nanopoulos A, Ivanovic M. Nearest neighbors in high-dimensional data: The emergence and influence of hubs. In: Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Quebec, Canada, 2009. 865–872
Gong G L. Probability theory and statistics (in Chinese). Beijing: Tsinghua University Press, 2006
Abdi H, Molin P. Lilliefors test of normality. In: Salkind N J, ed. Encyclopedia of Measurement and Statistics. Thousand Oaks: Sage Publications, Inc., 2007
Walpole R E, Myers R H, Myers S L, et al. Probability and Statistics for Engineers and Scientists. 8th ed. Upper Saddle River: Pearson Education, Inc., 2006
Black K. Business Statistics: Contemporary Decision Making. 6th ed. Hoboken: John Wiley & Sons, Inc., 2010
Conover W J. Practical Nonparametric Statistics (in Chinese). 3rd ed. Beijing: Posts & Telecom Press, 2006
Wang K, Zhang J, Li D, et al. Adaptive affinity propagation clustering (in Chinese). Acta Automat Sin, 2007, 33: 1242–1246
Golub T R, Slonim D K, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 1999, 286: 531–537
Hartuv E, Schmitt A, Lange J, et al. An algorithm for clustering cDNAs for gene expression analysis. Genomics, 2000, 66: 249–256
Dembélé D, Kastner P. Fuzzy C-means method for clustering microarray data. Bioinformatics 2003, 19: 973–980
Nene S A, Nayar S K, Murase H. Columbia Object Image Library (COIL-20). Technical Report CUCS-005-96. Columbia University, 1996
Clarke R, Ressom H, Wang A, et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer, 2008, 8: 37–49
Verleysen M, Franois D. The curse of dimensionality in data mining and time series prediction. In: Cabestany J, Prieto A, Sandoval D F, eds. Computational Intelligence and Bioinspired Systems. Berlin: Springer, 2005. 758–770
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, K., Yan, X. & Chen, L. Geometric double-entity model for recognizing far-near relations of clusters. Sci. China Inf. Sci. 54, 2040–2050 (2011). https://doi.org/10.1007/s11432-011-4386-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-011-4386-5