ABSTRACT
We address the issue of clustering numerical vectors with a network. The problem setting is basically equivalent to constrained clustering by Wagstaff and Cardie and semi-supervised clustering by Basu et al., but our focus is more on the optimal combination of two heterogeneous data sources. An application of this setting is web pages which can be numerically vectorized by their contents, e.g. term frequencies, and which are hyperlinked to each other, showing a network. Another typical application is genes whose behavior can be numerically measured and a gene network can be given from another data source.We first define a new graph clustering measure which we call normalized network modularity, by balancing the cluster size of the original modularity. We then propose a new clustering method which integrates the cost of clustering numerical vectors with the cost of maximizing the normalized network modularity into a spectral relaxation problem. Our learning algorithm is based on spectral clustering which makes our issue an eigenvalue problem and uses k-means for final cluster assignments. A significant advantage of our method is that we can optimize the weight parameter for balancing the two costs from the given data by choosing the minimum total cost. We evaluated the performance of our proposed method using a variety of datasets including synthetic data as well as real-world data from molecular biology. Experimental results showed that our method is effective enough to have good results for clustering by numerical vectors and a network.
- A.-L. Barabási and A. Reka. Emergence of scaling in random networks. Science, 286: 509--512, 1999.Google ScholarCross Ref
- S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD, pages 59--68, August 2004. Google ScholarDigital Library
- I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means, spectral clustering and normalized cuts. In KDD, pages 551--556, 2004. Google ScholarDigital Library
- I. S. Dhillon and S. Sra. Modeling data using directional distributions. Technical Report TR--06--03, University of Texas, Dept. of Computer Sciences, 2003.Google Scholar
- R. Edgar, M. Domrachev, and A. E. Lash. Gene expression omnibus: {NCBI gene expression and hybridization array data repository. NAR, 30(1): 207--210, 2002.Google ScholarCross Ref
- R. Guimera and L. A. Nunes Amaral. Functional cartography of complex metabolic networks. Nature, 433(7028): 895--900, 2005.Google ScholarCross Ref
- R. Guimera, M. Sales-Pardo, and L. A. N. Amaral. Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E, 70: 025101, 2004.Google ScholarCross Ref
- L. Hagen and A. B. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE TCAD, 11: 1074--1085, 1992.Google ScholarDigital Library
- T. R. Hughes et al. Functional discovery via a compendium of expression profiles. Cell, 102(1): 109--126, 2000.Google ScholarCross Ref
- M. Kanehisa et al. From genomics to chemical genomics: new developments in KEGG. NAR, 34: D354--357, 2006.Google ScholarCross Ref
- B. Kulis, S. Basu, I. Dhillon, and R. J. Mooney. Semi-supervised graph clustering: A kernel approach. In ICML, pages 457--464, 2005. Google ScholarDigital Library
- K. V. Mardia and P. E. Jupp. Directional Statistics. John Wiley & Sons, second edition, 2000.Google Scholar
- M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E, 69: 026113, 2004.Google ScholarCross Ref
- E. Ravasz et al. Hierarchical organization of modularity in metabolic networks. Science, 297(5589): 1551--1555, 2002.Google ScholarCross Ref
- J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE PAMI, 22(8): 888--905, 2000. Google ScholarDigital Library
- M. Shiga, I. Takigawa and H. Mamitsuka. Annotating gene function by combining expression data with a modular gene network. To appear in ISMB, 2007.Google Scholar
- C. Song, S. Havlin, and H. A. Makse. Self-similarity of complex networks. Nature, 433: 392--395, 2005.Google ScholarCross Ref
- A. Strehl and J. Ghosh. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS Journal on Computing, 15(2):208--230, 2003. Google ScholarDigital Library
- O. Troyanskaya et al. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520--525, 2001.Google ScholarCross Ref
- K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In ICML, pages 1103--1110, 2000. Google ScholarDigital Library
- D. J. Watts and S. H. Strogatz. Collective dynamics of 'small-world' networks. Nature, 393: 440--442, 1998.Google ScholarCross Ref
- S. White and P. Smyth. A spectral clustering approach to finding communities in graphs. In SDM, pages 76--84, 2005.Google ScholarCross Ref
- L. F. Wu et al. Large-scale prediction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet., 31(3):255--265, 2002.Google ScholarCross Ref
- S. Zhong and J. Ghosh. A unified framework for model--based clustering. JMLR, 4:1001--1037, 2003. Google ScholarDigital Library
- S. Zhong and J. Ghosh. Generative model-based document clustering: A comparative study. KAIS, 8(3):374--384, 2005. Google ScholarDigital Library
- X. Zhou, M. C. Kao, and W. H. Wong. Transitive functional annotation by shortest-path analysis of gene expression data. PNAS, 99(20):12783--12788, 2002.Google ScholarCross Ref
Index Terms
- A spectral clustering approach to optimally combining numericalvectors with a modular network
Recommendations
Spectral Clustering Algorithm for Navie Users
ICARCSET '15: Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering & Technology (ICARCSET 2015)Spectral Clustering is a graph theoretic technique to find groupings within the data. Mostly all the users will choose K-means clustering algorithm to finding the groups as it is easy to implement. To apply K-means algorithm user has to specify the ...
Local k-proximal plane clustering
k-Plane clustering (kPC) and k-proximal plane clustering (kPPC) cluster data points to the center plane, instead of clustering data points to cluster center in k-means. However, the cluster center plane constructed by kPC and kPPC is infinitely ...
Study on multi-center fuzzy C-means algorithm based on transitive closure and spectral clustering
Fuzzy C-means (FCM) clustering has been widely used successfully in many real-world applications. However, the FCM algorithm is sensitive to the initial prototypes, and it cannot handle non-traditional curved clusters. In this paper, a multi-center ...
Comments