skip to main content
research-article

The Complexity of Mining Maximal Frequent Subgraphs

Published:30 December 2014Publication History
Skip Abstract Section

Abstract

A frequent subgraph of a given collection of graphs is a graph that is isomorphic to a subgraph of at least as many graphs in the collection as a given threshold. Frequent subgraphs generalize frequent itemsets and arise in various contexts, from bioinformatics to the Web. Since the space of frequent subgraphs is typically extremely large, research in graph mining has focused on special types of frequent subgraphs that can be orders of magnitude smaller in number, yet encapsulate the space of all frequent subgraphs. Maximal frequent subgraphs (i.e., the ones not properly contained in any frequent subgraph) constitute the most useful such type.

In this article, we embark on a comprehensive investigation of the computational complexity of mining maximal frequent subgraphs. Our study is carried out by considering the effect of three different parameters: possible restrictions on the class of graphs; a fixed bound on the threshold; and a fixed bound on the number of desired answers. We focus on specific classes of connected graphs: general graphs, planar graphs, graphs of bounded degree, and graphs of bounded treewidth (trees being a special case). Moreover, each class has two variants: that in which the nodes are unlabeled, and that in which they are uniquely labeled. We delineate the complexity of the enumeration problem for each of these variants by determining when it is solvable in (total or incremental) polynomial time and when it is NP-hard. Specifically, for the labeled classes, we show that bounding the threshold yields tractability but, in most cases, bounding the number of answers does not, unless P=NP; an exception is the case of labeled trees, where bounding either of these two parameters yields tractability. The state of affairs turns out to be quite different for the unlabeled classes. The main (and most challenging to prove) result concerns unlabeled trees: we show NP-hardness, even if the input consists of two trees and both the threshold and the number of desired answers are equal to just two. In other words, we establish that the following problem is NP-complete: given two unlabeled trees, do they have more than one maximal subtree in common?

References

  1. Laszlo Babai and Eugene M. Luks. 1983. Canonical labeling of graphs. In Proceedings of the 15th Annual ACM Symposium on Theory of Computing (STOC'83). ACM Press, New York, 171--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Endre Boros, Vladimir Gurvich, Leonid Khachiyan, and Kazuhisa Makino. 2003. On maximal frequent and minimal infrequent sets in binary matrices. Ann. Math. Artif. Intell. 39, 3, 211--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sara Cohen, Itzhak Fadida, Yaron Kanza, Benny Kimelfeld, and Yehoshua Sagiv. 2006. Full disjunctions: Polynomial-delay iterators in action. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). ACM Press, New York, 739--750. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis. 2005. Frequent substructure-based approaches for classifying chemical compounds. IEEE Trans. Knowl. Data Engin. 17, 8, 1036--1050. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rodney G. Downey and Michael R. Fellows. 1999. Parameterized Complexity. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michael R. Garey, David S. Johnson, and Robert Endre Tarjan. 1976. The planar hamiltonian circuit problem is np-complete. SIAM J. Comput. 5, 4, 704--714.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Gianluigi Greco, Antonella Guzzo, Giuseppe Manco, and Domenico Sacca. 2005. Mining and reasoning on workflows. IEEE Trans. Knowl. Data Engin. 17, 4, 519--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gianluigi Greco, Antonella Guzzo, Giuseppe Manco, and Domenico Sacca. 2007. Mining unconnected patterns in workflows. Inf. Syst. 32, 5, 685--712. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ehud Gudes, Solomon Eyal Shimony, and Natalia Vanetik. 2006. Discovering frequent graph patterns using disjoint paths. IEEE Trans. Knowl. Data Engin. 18, 11, 1441--1456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dimitrios Gunopulos, Roni Khardon, Heikki Mannila, Sanjeev Saluja, Hannu Toivonen, and Ram Sewak Sharm. 2003. Discovering all most specific sentences. ACM Trans. Database Syst. 28, 2, 140--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. John E. Hopcroft and Robert Endre Tarjan. 1972. Isomorphism of planar graphs. In Complexity of Computer Computations. The IBM Research Symposia Series. Plenum Press, New York, 131--152.Google ScholarGoogle Scholar
  12. Jun Huan, Wei Wang, Jan Prins, and Jiong Yang. 2004. SPIN: Mining maximal frequent subgraphs from graph databases. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). ACM Press, New York, 581--586. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD'00). Lecture Notes in Computer Science, vol. 1910, Springer, 13--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2003. Complete mining of frequent patterns from graphs: Mining graph data. Mach. Learn. 50, 3, 321--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. David S. Johnson, Mihalis Yannakakis, and Christos H. Papadimitriou. 1988. On generating all maximal independent sets. Inf. Process. Lett. 27, 119--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Leonid Khachiyan, Endre Boros, Konrad Borys, Khaled M. Elbassioni, and Vladimir Gurvich. 2008. Generating all vertices of a polyhedron is hard. Discr. Comput. Geom. 39, 1--3, 174--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Benny Kimelfeld and Phokion G. Kolaitis. 2013. The complexity of mining maximal frequent subgraphs. In Proceedings of the 32nd Symposium on Principles of Database Systems (PODS'13). ACM Press, New York, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Benny Kimelfeld and Yehoshua Sagiv. 2007. Maximally joining probabilistic data. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'07). ACM Press, New York, 303--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Michihiro Kuramochi and George Karypis. 2001. Frequent subgraph discovery. In Proceedings of the IEEE International Conference on Data Mining (ICDM'01). IEEE Computer Society, 313--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Michihiro Kuramochi and George Karypis. 2004. An efficient algorithm for discovering frequent subgraphs. IEEE Trans. Knowl. Data Engin. 16, 9, 1038--1051. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kazuhisa Makino and Toshihide Ibaraki. 1996. Interior and exterior functions of boolean functions. Discr. Appl. Math. 69, 3, 209--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jiri Matousek and Robin Thomas. 1992. On the complexity of finding iso- and other morphisms for partial k-trees. Discr. Math. 108, 1--3, 343--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Raymond J. Mooney, Prem Melville, Rupert Lapoon Tang, Jude Shavlik, Inłs Dutra, David Page, and Vitor Santos Costa. 2004. Relational data mining with inductive logic programming for link discovery. In Data Mining: Next Generation Challenges and Future Directions, AAAI Press, 239--254.Google ScholarGoogle Scholar
  24. Siegfried Nijssen and Joost N. Kok. 2004. Frequent graph mining and its application to molecular databases. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC'04). Vol. 5, 4571--4577.Google ScholarGoogle Scholar
  25. Yoshio Okamoto, Takeaki Uno, and Ryuhei Uehara. 2008. Counting the number of independent sets in chordal graphs. J. Discr. Algor. 6, 2, 229--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jian Pei, Jiawei Han, and Runying Mao. 2000. CLOSET: An efficient algorithm for mining frequent closed itemsets. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. 21--30.Google ScholarGoogle Scholar
  27. Alina Stoica and Christophe Prieur. 2009. Structure of neighborhoods in a large social network. In Proceedings of the International Conference on Computational Science and Engineering (CSE'09). IEEE Computer Society, 26--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lini T. Thomas, Satyanarayana R. Valluri, and Kamalakar Karlapalem. 2010. MARGIN: Maximal frequent subgraph mining. ACM Trans. Knowl. Discov. Data 4, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Seinosuke Toda and Mitsunori Ogiwara. 1992. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput. 21, 2, 316--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Leslie G. Valiant. 1979a. The complexity of computing the permanent. Theor. Comput. Sci. 8, 189--201.Google ScholarGoogle ScholarCross RefCross Ref
  31. Leslie G. Valiant. 1979b. The complexity of enumeration and reliability problems. SIAM J. Comput. 8, 3, 410--421.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Fabian Wagner. 2011. Graphs of bounded treewidth can be canonized in ac1 computer science. In Proceedings of the 6th International Conference on Computer Science: Theory and Applications (CSR'11). Lecture Notes in Computer Science, vol. 6651, Springer, 209--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jianyong Wang, Jiawei Han, and Jian Pei. 2003. CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). ACM Press, New York, 236--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xifeng Yan and Jiawei Han. 2002. gSpan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM'02). IEEE Computer Society, 721--724. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xifeng Yan and Jiawei Han. 2003. CloseGraph: Mining closed frequent graph patterns. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). ACM Press, New York, 286--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Guizhen Yang. 2004. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). ACM Press, New York, 344--353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mohammed Javeed Zaki and Ching-Jiu Hsiao. 2002. CHARM: An efficient algorithm for closed itemset mining. In Proceedings of the 2nd SIAM International Conference on Data Mining (SDM'02). 457--473.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. The Complexity of Mining Maximal Frequent Subgraphs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Database Systems
        ACM Transactions on Database Systems  Volume 39, Issue 4
        Invited Articles Issue, SIGMOD 2013, PODS 2013 and ICDT 2013
        December 2014
        341 pages
        ISSN:0362-5915
        EISSN:1557-4644
        DOI:10.1145/2691190
        Issue’s Table of Contents

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 December 2014
        • Accepted: 1 February 2014
        • Received: 1 October 2013
        Published in tods Volume 39, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader