Skip to main content

Advertisement

Log in

Bipartite isoperimetric graph partitioning for data co-clustering

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Data co-clustering refers to the problem of simultaneous clustering of two data types. Typically, the data is stored in a contingency or co-occurrence matrix C where rows and columns of the matrix represent the data types to be co-clustered. An entry C ij of the matrix signifies the relation between the data type represented by row i and column j. Co-clustering is the problem of deriving sub-matrices from the larger data matrix by simultaneously clustering rows and columns of the data matrix. In this paper, we present a novel graph theoretic approach to data co-clustering. The two data types are modeled as the two sets of vertices of a weighted bipartite graph. We then propose Isoperimetric Co-clustering Algorithm (ICA)—a new method for partitioning the bipartite graph. ICA requires a simple solution to a sparse system of linear equations instead of the eigenvalue or SVD problem in the popular spectral co-clustering approach. Our theoretical analysis and extensive experiments performed on publicly available datasets demonstrate the advantages of ICA over other approaches in terms of the quality, efficiency and stability in partitioning the bipartite graph.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Alon N (1986). Eigenvalues and expanders. Combinatorica 6(2): 83–96

    Article  MATH  MathSciNet  Google Scholar 

  • Alon N and Milman VD (1985). λ1 isoperimetric inequalities for graphs and superconcentrators. J Comb Theory Ser B 38: 73–88

    Article  MATH  MathSciNet  Google Scholar 

  • Alpert CJ and Kahng AB (1995). Recent directions in netlist partitioning: a survey. Integr VLSI J 19(12): 1–81

    Article  MATH  Google Scholar 

  • Anderson WN and Morley TD (1985). Eigenvalues of the laplacian of a graph. Linear Multilinear Algebra 18: 141–145

    Article  MATH  MathSciNet  Google Scholar 

  • Arfken GB, Weber HJ (2000) Mathematical methods for physicists, 5th edn. Academic Press

  • Banerjee A, Dhillon IS, Ghosh J, Merugu S, Modha DS (2004) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’04), pp 509–514

  • Biggs N (1974) Algebraic graph theory. Cambridge University Press

  • Boley D, Gini M, Gross R, Han E-H, Hastings K, Karypis G, Kumar V, Mobasher B and Moore J (1999). Document categorization and query generation on the world wide web using webace. AI Rev 11: 365–391

    Google Scholar 

  • Cai R, Lu L, Hanjalic A (2005) Unsupervised content discovery in composite audio. In: Proceedings of the 13th annual ACM international conference on Multimedia (MM ’05), pp 628–637

  • Cheeger J (1970) A lower bound for the smallest eigenvalue of the laplacian. In: Gunning RC (ed) Problems in Analysis. Princeton Univ. Press, pp 195–199

  • Chung FRK (1997) Spectral graph theory. American Mathematical Society

  • Demmel JW (1997) Applied numerical linear algebra. SIAM

  • Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD)

  • Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’03), pp 89–98

  • Ding CHQ (2003a) Document retrieval and clustering: from principal component analysis to self-aggregation networks. In: Proceedings of int’l parallel and distributed processing symposium proceedings of 9th int’l workshop on artificial intelligence and statistics

  • Ding CHQ (2003b). Unsupervised feature selection via two-way ordering in gene expression analysis. Bioinformatics 19: 1259–1266

    Article  Google Scholar 

  • Ding CHQ, He X, Meraz RF and Holbrook SR (2004). A unified representation of multiprotein complex data for modeling interaction networks. Proteins: Struct Func Bioinform 57(1): 99–108

    Article  Google Scholar 

  • Dodziuk J (1984). Difference equations, isoperimetric inequality and the transience of certain random walks. Trans Am Math Soc 284: 787–794

    Article  MATH  MathSciNet  Google Scholar 

  • Dodziuk J, Kendall WS (1986) Combinatorial laplacians and isoperimetric inequality. In: From local times to global geometry, control and physics. Pitman Research Notes in Mathematics Series 150:68–74, [Longman Scientific and Techical]

  • Donath WE and Hoffman AJ (1972). Algorithms for partitioning of graphs and computer logic based on eigenvectors of connection matrices. IBM Tehn Disclosure Bull 15: 938–944

    Google Scholar 

  • Donath WE and Hoffman AJ (1973). Lower bounds for the partitioning of graphs. IBM J Res Dev 17: 420–425

    Article  MATH  MathSciNet  Google Scholar 

  • Dongen SV (2000) Graph clustering by flow simulation. PhD thesis, University of Utrecht

  • Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley

  • Enright AJ, Dongen SV and Ouzounis CA (2002). An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7): 1575–1584

    Article  Google Scholar 

  • Fiedler M (1973). Algebraic connectivity of graphs. Czech Math J 23: 298–305

    MathSciNet  Google Scholar 

  • Fiedler M (1975a). Eigenvectors of acyclic matrices. Czech Math J 25: 607–618

    MathSciNet  Google Scholar 

  • Fiedler M (1975b). A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czech Math J 25: 619–633

    MathSciNet  Google Scholar 

  • Fiedler M (1986) Special matrices and their applications in numerical mathematics. Martinus Nijhoff Publishers

  • Garey MR, Johnson DS (1979) Computers and intractability; a guide to the theory of NP-completeness. W. H. Freeman and Company

  • George T, Merugu S (2005) A scalable collaborative filtering framework based on co-clustering. In: Proceedings of the fifth IEEE international conference on data mining (ICDM ’05)

  • Gilbert JR, Miller GL and Teng SH (1998). Geometric mesh partitioning: implementation and experiments. SIAM J Sci Comput 19(6): 2091–2110

    Article  MATH  MathSciNet  Google Scholar 

  • Golub GH, Van-Loan CF (1989) Matrix computations. John Hopkins Press

  • Gonzalez RC and Woods RE (2002). Digital image processing. Prentice Hall, Upper Saddle River

    Google Scholar 

  • Grady L and Schwartz EL (2006a). Isoperimetric graph partitioning for image segmentation. IEEE Trans Pattern Anal Mach Intell 28(3): 469–475

    Article  Google Scholar 

  • Grady L and Schwartz EL (2006b). Isoperimetric partitioning: A new algorithm for graph partitioning. SIAM J Sci Comput 27(6): 1844–1866

    Article  MATH  MathSciNet  Google Scholar 

  • Guattery S and Miller GL (1998). On the quality of spectral separators. SIAM J Matrix Anal Appl 19(3): 701–719

    Article  MATH  MathSciNet  Google Scholar 

  • Hagen L and Kahng AB (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans Comput Aid Design Integr Circuits Sys 11(9): 1074–1085

    Article  Google Scholar 

  • Han E-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: Proceedings of 4th European conference on principles and practice of knowledge discovery in databases (PKDD ’00), pp 424–431

  • Hendrickson B, Leland R (1995) The chaco user’s guide. Technical Report SAND95-2344, Sandia National Laboratories, Albuquerque

  • Hersh W, Buckley C, Leone TJ, Hickam D (1994) Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’94), pp 192–201

  • Hopfield JJ (1982). Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci USA 79: 2554–2558

    Article  MathSciNet  Google Scholar 

  • Jain AK, Murty MN and Flynn PJ (1999). Data clustering: a review. ACM Comput Surv 31(3): 264–323

    Article  Google Scholar 

  • Jolliffe IT (2002). Principal component analysis, 2nd edn. Springer, New York

    Google Scholar 

  • Kuijlaars ABJ (2001). Which eigenvalues are found by the Lanczos method. SIAM J Matrix Anal Appl 22(1): 306–321

    Article  MathSciNet  Google Scholar 

  • Kumar R, Mahadevan U, Sivakumar D (2004) A graph-theoretic approach to extract storylines from search results. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’04), pp 216–225

  • Kummamuru K, Dhawale A, Krishnapuram R (2003) Fuzzy co-clustering of documents and keywords. In: Proceedings of The 12th IEEE international conference on fuzzy systems (FUZZ ’03), pp 772–777

  • Lewis DD (1999) Reuters-21578 text categorization test collection distribution 1.0, http://www.daviddlewis.com/resources/testcollections/reuters21578/

  • Long B, Zhang Z, Yu PS (2005) Co-clustering by block value decomposition. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining (KDD ’05), pp 635–640

  • Mandhani B, Joshi S, Kummamuru K (2003) A matrix density based algorithm to hierarchically co-cluster documents and words. In: Proceedings of the 12th international conference on World Wide Web (WWW ’03), pp 511–518

  • Merris R (1994). Laplacian matrices of graphs: a survey. Linear Algebra Appl 197: 143–176

    Article  MathSciNet  Google Scholar 

  • Mohar B (1989). Isoperimetric numbers of graphs. J Comb Theory Ser B 47: 274–291

    Article  MATH  MathSciNet  Google Scholar 

  • Mohar B (1991). The Laplacian spectrum of graphs. Graph Theory Comb Appl 2: 871–898

    MathSciNet  Google Scholar 

  • Oh C-H, Honda K, Ichihashi H (2001) Fuzzy clustering for categorical multivariate data. In: Proceedings of joint 9th IFSA world congress and 20th NAFIPS international conference, pp 2154–2159

  • Porter MF (1980). An algorithm for suffix stripping. Program 14(3): 130–137

    Google Scholar 

  • Qiu G (2004) Image and feature co-clustering. In: Proceedings of IEEE ICPR

  • Rege M, Dong M, Fotouhi F (2006a) Co-clustering documents and words using bipartite isoperimetric graph partitioning. In: Proceedings of the 6th IEEE international conference on data mining (ICDM)

  • Rege M, Dong M, Fotouhi F (2006b) Co-clustering image features and semantic concepts. In: Proceedings of IEEE international conference on image processing

  • Rui Y, Huang TS, Mehrotra S (1997) Content-based image retrieval with relevance feedback in mars. In: Proceedins of IEEE International conference on image processing

  • Shi J and Malik J (2000). Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905

    Article  Google Scholar 

  • Simon HD (1991). Partitioning of unstructured problems for parallel processing. Comput Syst Eng 2: 135–148

    Article  Google Scholar 

  • Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Research and development in information retrieval, pp 208–215

  • Smeulders AWM, Worring M, Santini S, Gupta A and Jain R (2000). Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12): 1349–1380

    Article  Google Scholar 

  • TREC (1996, 1997, 1998) Text retrieval conference, http://trec.nist.gov

  • Wu X, Ngo CW, Li Q (2005) Co-clustering of time-evolving news story with transcript and keyframe. In: Proceedings of IEEE international conference on multimedia and expo (ICME ’05), pp 117–120

  • Zha H, He X, Ding CHQ, Simon H, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the tenth international conference on information and knowledge management (CIKM)

  • Zha H, Ji X (2002) Correlating multilingual documents via bipartite graph modeling. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’02)

  • Zhao R and Grosky WI (2002). Narrowing the semantic gap-improved text-based web document retrieval using visual features. IEEE Trans Multimedia 4(2): 189–200

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manjeet Rege.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rege, M., Dong, M. & Fotouhi, F. Bipartite isoperimetric graph partitioning for data co-clustering. Data Min Knowl Disc 16, 276–312 (2008). https://doi.org/10.1007/s10618-008-0091-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0091-4

Keywords

Navigation