Abstract
Data co-clustering refers to the problem of simultaneous clustering of two data types. Typically, the data is stored in a contingency or co-occurrence matrix C where rows and columns of the matrix represent the data types to be co-clustered. An entry C ij of the matrix signifies the relation between the data type represented by row i and column j. Co-clustering is the problem of deriving sub-matrices from the larger data matrix by simultaneously clustering rows and columns of the data matrix. In this paper, we present a novel graph theoretic approach to data co-clustering. The two data types are modeled as the two sets of vertices of a weighted bipartite graph. We then propose Isoperimetric Co-clustering Algorithm (ICA)—a new method for partitioning the bipartite graph. ICA requires a simple solution to a sparse system of linear equations instead of the eigenvalue or SVD problem in the popular spectral co-clustering approach. Our theoretical analysis and extensive experiments performed on publicly available datasets demonstrate the advantages of ICA over other approaches in terms of the quality, efficiency and stability in partitioning the bipartite graph.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alon N (1986). Eigenvalues and expanders. Combinatorica 6(2): 83–96
Alon N and Milman VD (1985). λ1 isoperimetric inequalities for graphs and superconcentrators. J Comb Theory Ser B 38: 73–88
Alpert CJ and Kahng AB (1995). Recent directions in netlist partitioning: a survey. Integr VLSI J 19(12): 1–81
Anderson WN and Morley TD (1985). Eigenvalues of the laplacian of a graph. Linear Multilinear Algebra 18: 141–145
Arfken GB, Weber HJ (2000) Mathematical methods for physicists, 5th edn. Academic Press
Banerjee A, Dhillon IS, Ghosh J, Merugu S, Modha DS (2004) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’04), pp 509–514
Biggs N (1974) Algebraic graph theory. Cambridge University Press
Boley D, Gini M, Gross R, Han E-H, Hastings K, Karypis G, Kumar V, Mobasher B and Moore J (1999). Document categorization and query generation on the world wide web using webace. AI Rev 11: 365–391
Cai R, Lu L, Hanjalic A (2005) Unsupervised content discovery in composite audio. In: Proceedings of the 13th annual ACM international conference on Multimedia (MM ’05), pp 628–637
Cheeger J (1970) A lower bound for the smallest eigenvalue of the laplacian. In: Gunning RC (ed) Problems in Analysis. Princeton Univ. Press, pp 195–199
Chung FRK (1997) Spectral graph theory. American Mathematical Society
Demmel JW (1997) Applied numerical linear algebra. SIAM
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD)
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’03), pp 89–98
Ding CHQ (2003a) Document retrieval and clustering: from principal component analysis to self-aggregation networks. In: Proceedings of int’l parallel and distributed processing symposium proceedings of 9th int’l workshop on artificial intelligence and statistics
Ding CHQ (2003b). Unsupervised feature selection via two-way ordering in gene expression analysis. Bioinformatics 19: 1259–1266
Ding CHQ, He X, Meraz RF and Holbrook SR (2004). A unified representation of multiprotein complex data for modeling interaction networks. Proteins: Struct Func Bioinform 57(1): 99–108
Dodziuk J (1984). Difference equations, isoperimetric inequality and the transience of certain random walks. Trans Am Math Soc 284: 787–794
Dodziuk J, Kendall WS (1986) Combinatorial laplacians and isoperimetric inequality. In: From local times to global geometry, control and physics. Pitman Research Notes in Mathematics Series 150:68–74, [Longman Scientific and Techical]
Donath WE and Hoffman AJ (1972). Algorithms for partitioning of graphs and computer logic based on eigenvectors of connection matrices. IBM Tehn Disclosure Bull 15: 938–944
Donath WE and Hoffman AJ (1973). Lower bounds for the partitioning of graphs. IBM J Res Dev 17: 420–425
Dongen SV (2000) Graph clustering by flow simulation. PhD thesis, University of Utrecht
Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley
Enright AJ, Dongen SV and Ouzounis CA (2002). An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7): 1575–1584
Fiedler M (1973). Algebraic connectivity of graphs. Czech Math J 23: 298–305
Fiedler M (1975a). Eigenvectors of acyclic matrices. Czech Math J 25: 607–618
Fiedler M (1975b). A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czech Math J 25: 619–633
Fiedler M (1986) Special matrices and their applications in numerical mathematics. Martinus Nijhoff Publishers
Garey MR, Johnson DS (1979) Computers and intractability; a guide to the theory of NP-completeness. W. H. Freeman and Company
George T, Merugu S (2005) A scalable collaborative filtering framework based on co-clustering. In: Proceedings of the fifth IEEE international conference on data mining (ICDM ’05)
Gilbert JR, Miller GL and Teng SH (1998). Geometric mesh partitioning: implementation and experiments. SIAM J Sci Comput 19(6): 2091–2110
Golub GH, Van-Loan CF (1989) Matrix computations. John Hopkins Press
Gonzalez RC and Woods RE (2002). Digital image processing. Prentice Hall, Upper Saddle River
Grady L and Schwartz EL (2006a). Isoperimetric graph partitioning for image segmentation. IEEE Trans Pattern Anal Mach Intell 28(3): 469–475
Grady L and Schwartz EL (2006b). Isoperimetric partitioning: A new algorithm for graph partitioning. SIAM J Sci Comput 27(6): 1844–1866
Guattery S and Miller GL (1998). On the quality of spectral separators. SIAM J Matrix Anal Appl 19(3): 701–719
Hagen L and Kahng AB (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans Comput Aid Design Integr Circuits Sys 11(9): 1074–1085
Han E-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: Proceedings of 4th European conference on principles and practice of knowledge discovery in databases (PKDD ’00), pp 424–431
Hendrickson B, Leland R (1995) The chaco user’s guide. Technical Report SAND95-2344, Sandia National Laboratories, Albuquerque
Hersh W, Buckley C, Leone TJ, Hickam D (1994) Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’94), pp 192–201
Hopfield JJ (1982). Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci USA 79: 2554–2558
Jain AK, Murty MN and Flynn PJ (1999). Data clustering: a review. ACM Comput Surv 31(3): 264–323
Jolliffe IT (2002). Principal component analysis, 2nd edn. Springer, New York
Kuijlaars ABJ (2001). Which eigenvalues are found by the Lanczos method. SIAM J Matrix Anal Appl 22(1): 306–321
Kumar R, Mahadevan U, Sivakumar D (2004) A graph-theoretic approach to extract storylines from search results. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’04), pp 216–225
Kummamuru K, Dhawale A, Krishnapuram R (2003) Fuzzy co-clustering of documents and keywords. In: Proceedings of The 12th IEEE international conference on fuzzy systems (FUZZ ’03), pp 772–777
Lewis DD (1999) Reuters-21578 text categorization test collection distribution 1.0, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Long B, Zhang Z, Yu PS (2005) Co-clustering by block value decomposition. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining (KDD ’05), pp 635–640
Mandhani B, Joshi S, Kummamuru K (2003) A matrix density based algorithm to hierarchically co-cluster documents and words. In: Proceedings of the 12th international conference on World Wide Web (WWW ’03), pp 511–518
Merris R (1994). Laplacian matrices of graphs: a survey. Linear Algebra Appl 197: 143–176
Mohar B (1989). Isoperimetric numbers of graphs. J Comb Theory Ser B 47: 274–291
Mohar B (1991). The Laplacian spectrum of graphs. Graph Theory Comb Appl 2: 871–898
Oh C-H, Honda K, Ichihashi H (2001) Fuzzy clustering for categorical multivariate data. In: Proceedings of joint 9th IFSA world congress and 20th NAFIPS international conference, pp 2154–2159
Porter MF (1980). An algorithm for suffix stripping. Program 14(3): 130–137
Qiu G (2004) Image and feature co-clustering. In: Proceedings of IEEE ICPR
Rege M, Dong M, Fotouhi F (2006a) Co-clustering documents and words using bipartite isoperimetric graph partitioning. In: Proceedings of the 6th IEEE international conference on data mining (ICDM)
Rege M, Dong M, Fotouhi F (2006b) Co-clustering image features and semantic concepts. In: Proceedings of IEEE international conference on image processing
Rui Y, Huang TS, Mehrotra S (1997) Content-based image retrieval with relevance feedback in mars. In: Proceedins of IEEE International conference on image processing
Shi J and Malik J (2000). Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905
Simon HD (1991). Partitioning of unstructured problems for parallel processing. Comput Syst Eng 2: 135–148
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Research and development in information retrieval, pp 208–215
Smeulders AWM, Worring M, Santini S, Gupta A and Jain R (2000). Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12): 1349–1380
TREC (1996, 1997, 1998) Text retrieval conference, http://trec.nist.gov
Wu X, Ngo CW, Li Q (2005) Co-clustering of time-evolving news story with transcript and keyframe. In: Proceedings of IEEE international conference on multimedia and expo (ICME ’05), pp 117–120
Zha H, He X, Ding CHQ, Simon H, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the tenth international conference on information and knowledge management (CIKM)
Zha H, Ji X (2002) Correlating multilingual documents via bipartite graph modeling. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’02)
Zhao R and Grosky WI (2002). Narrowing the semantic gap-improved text-based web document retrieval using visual features. IEEE Trans Multimedia 4(2): 189–200
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charu Aggarwal.
Rights and permissions
About this article
Cite this article
Rege, M., Dong, M. & Fotouhi, F. Bipartite isoperimetric graph partitioning for data co-clustering. Data Min Knowl Disc 16, 276–312 (2008). https://doi.org/10.1007/s10618-008-0091-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-008-0091-4