research-article

The Complexity of Mining Maximal Frequent Subgraphs

Authors:
Benny Kimelfeld

LogicBlox, Inc., Atlanta, GA

LogicBlox, Inc., Atlanta, GA
View Profile

,
Phokion G. Kolaitis

University of California, Santa Cruz and IBM Research -- Almaden, San Jose, CA

University of California, Santa Cruz and IBM Research -- Almaden, San Jose, CA
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 39 Issue 4Article No.: 32pp 1–33https://doi.org/10.1145/2629550

Published:30 December 2014Publication History

ACM Transactions on Database Systems

Abstract

A frequent subgraph of a given collection of graphs is a graph that is isomorphic to a subgraph of at least as many graphs in the collection as a given threshold. Frequent subgraphs generalize frequent itemsets and arise in various contexts, from bioinformatics to the Web. Since the space of frequent subgraphs is typically extremely large, research in graph mining has focused on special types of frequent subgraphs that can be orders of magnitude smaller in number, yet encapsulate the space of all frequent subgraphs. Maximal frequent subgraphs (i.e., the ones not properly contained in any frequent subgraph) constitute the most useful such type.

In this article, we embark on a comprehensive investigation of the computational complexity of mining maximal frequent subgraphs. Our study is carried out by considering the effect of three different parameters: possible restrictions on the class of graphs; a fixed bound on the threshold; and a fixed bound on the number of desired answers. We focus on specific classes of connected graphs: general graphs, planar graphs, graphs of bounded degree, and graphs of bounded treewidth (trees being a special case). Moreover, each class has two variants: that in which the nodes are unlabeled, and that in which they are uniquely labeled. We delineate the complexity of the enumeration problem for each of these variants by determining when it is solvable in (total or incremental) polynomial time and when it is NP-hard. Specifically, for the labeled classes, we show that bounding the threshold yields tractability but, in most cases, bounding the number of answers does not, unless P=NP; an exception is the case of labeled trees, where bounding either of these two parameters yields tractability. The state of affairs turns out to be quite different for the unlabeled classes. The main (and most challenging to prove) result concerns unlabeled trees: we show NP-hardness, even if the input consists of two trees and both the threshold and the number of desired answers are equal to just two. In other words, we establish that the following problem is NP-complete: given two unlabeled trees, do they have more than one maximal subtree in common?

References

Laszlo Babai and Eugene M. Luks. 1983. Canonical labeling of graphs. In Proceedings of the 15^th Annual ACM Symposium on Theory of Computing (STOC'83). ACM Press, New York, 171--183. Google ScholarDigital Library
Endre Boros, Vladimir Gurvich, Leonid Khachiyan, and Kazuhisa Makino. 2003. On maximal frequent and minimal infrequent sets in binary matrices. Ann. Math. Artif. Intell. 39, 3, 211--221. Google ScholarDigital Library
Sara Cohen, Itzhak Fadida, Yaron Kanza, Benny Kimelfeld, and Yehoshua Sagiv. 2006. Full disjunctions: Polynomial-delay iterators in action. In Proceedings of the 32^nd International Conference on Very Large Data Bases (VLDB'06). ACM Press, New York, 739--750. Google ScholarDigital Library
Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis. 2005. Frequent substructure-based approaches for classifying chemical compounds. IEEE Trans. Knowl. Data Engin. 17, 8, 1036--1050. Google ScholarDigital Library
Rodney G. Downey and Michael R. Fellows. 1999. Parameterized Complexity. Springer. Google ScholarDigital Library
Michael R. Garey, David S. Johnson, and Robert Endre Tarjan. 1976. The planar hamiltonian circuit problem is np-complete. SIAM J. Comput. 5, 4, 704--714.Google ScholarDigital Library
Gianluigi Greco, Antonella Guzzo, Giuseppe Manco, and Domenico Sacca. 2005. Mining and reasoning on workflows. IEEE Trans. Knowl. Data Engin. 17, 4, 519--534. Google ScholarDigital Library
Gianluigi Greco, Antonella Guzzo, Giuseppe Manco, and Domenico Sacca. 2007. Mining unconnected patterns in workflows. Inf. Syst. 32, 5, 685--712. Google ScholarDigital Library
Ehud Gudes, Solomon Eyal Shimony, and Natalia Vanetik. 2006. Discovering frequent graph patterns using disjoint paths. IEEE Trans. Knowl. Data Engin. 18, 11, 1441--1456. Google ScholarDigital Library
Dimitrios Gunopulos, Roni Khardon, Heikki Mannila, Sanjeev Saluja, Hannu Toivonen, and Ram Sewak Sharm. 2003. Discovering all most specific sentences. ACM Trans. Database Syst. 28, 2, 140--174. Google ScholarDigital Library
John E. Hopcroft and Robert Endre Tarjan. 1972. Isomorphism of planar graphs. In Complexity of Computer Computations. The IBM Research Symposia Series. Plenum Press, New York, 131--152.Google Scholar
Jun Huan, Wei Wang, Jan Prins, and Jiong Yang. 2004. SPIN: Mining maximal frequent subgraphs from graph databases. In Proceedings of the 10^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). ACM Press, New York, 581--586. Google ScholarDigital Library
Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proceedings of the 4^th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD'00). Lecture Notes in Computer Science, vol. 1910, Springer, 13--23. Google ScholarDigital Library
Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2003. Complete mining of frequent patterns from graphs: Mining graph data. Mach. Learn. 50, 3, 321--354. Google ScholarDigital Library
David S. Johnson, Mihalis Yannakakis, and Christos H. Papadimitriou. 1988. On generating all maximal independent sets. Inf. Process. Lett. 27, 119--123. Google ScholarDigital Library
Leonid Khachiyan, Endre Boros, Konrad Borys, Khaled M. Elbassioni, and Vladimir Gurvich. 2008. Generating all vertices of a polyhedron is hard. Discr. Comput. Geom. 39, 1--3, 174--190. Google ScholarDigital Library
Benny Kimelfeld and Phokion G. Kolaitis. 2013. The complexity of mining maximal frequent subgraphs. In Proceedings of the 32^nd Symposium on Principles of Database Systems (PODS'13). ACM Press, New York, 13--24. Google ScholarDigital Library
Benny Kimelfeld and Yehoshua Sagiv. 2007. Maximally joining probabilistic data. In Proceedings of the 26^th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'07). ACM Press, New York, 303--312. Google ScholarDigital Library
Michihiro Kuramochi and George Karypis. 2001. Frequent subgraph discovery. In Proceedings of the IEEE International Conference on Data Mining (ICDM'01). IEEE Computer Society, 313--320. Google ScholarDigital Library
Michihiro Kuramochi and George Karypis. 2004. An efficient algorithm for discovering frequent subgraphs. IEEE Trans. Knowl. Data Engin. 16, 9, 1038--1051. Google ScholarDigital Library
Kazuhisa Makino and Toshihide Ibaraki. 1996. Interior and exterior functions of boolean functions. Discr. Appl. Math. 69, 3, 209--231. Google ScholarDigital Library
Jiri Matousek and Robin Thomas. 1992. On the complexity of finding iso- and other morphisms for partial k-trees. Discr. Math. 108, 1--3, 343--364. Google ScholarDigital Library
Raymond J. Mooney, Prem Melville, Rupert Lapoon Tang, Jude Shavlik, Inłs Dutra, David Page, and Vitor Santos Costa. 2004. Relational data mining with inductive logic programming for link discovery. In Data Mining: Next Generation Challenges and Future Directions, AAAI Press, 239--254.Google Scholar
Siegfried Nijssen and Joost N. Kok. 2004. Frequent graph mining and its application to molecular databases. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC'04). Vol. 5, 4571--4577.Google Scholar
Yoshio Okamoto, Takeaki Uno, and Ryuhei Uehara. 2008. Counting the number of independent sets in chordal graphs. J. Discr. Algor. 6, 2, 229--242. Google ScholarDigital Library
Jian Pei, Jiawei Han, and Runying Mao. 2000. CLOSET: An efficient algorithm for mining frequent closed itemsets. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. 21--30.Google Scholar
Alina Stoica and Christophe Prieur. 2009. Structure of neighborhoods in a large social network. In Proceedings of the International Conference on Computational Science and Engineering (CSE'09). IEEE Computer Society, 26--33. Google ScholarDigital Library
Lini T. Thomas, Satyanarayana R. Valluri, and Kamalakar Karlapalem. 2010. MARGIN: Maximal frequent subgraph mining. ACM Trans. Knowl. Discov. Data 4, 3. Google ScholarDigital Library
Seinosuke Toda and Mitsunori Ogiwara. 1992. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput. 21, 2, 316--328. Google ScholarDigital Library
Leslie G. Valiant. 1979a. The complexity of computing the permanent. Theor. Comput. Sci. 8, 189--201.Google ScholarCross Ref
Leslie G. Valiant. 1979b. The complexity of enumeration and reliability problems. SIAM J. Comput. 8, 3, 410--421.Google ScholarDigital Library
Fabian Wagner. 2011. Graphs of bounded treewidth can be canonized in ac¹ computer science. In Proceedings of the 6^th International Conference on Computer Science: Theory and Applications (CSR'11). Lecture Notes in Computer Science, vol. 6651, Springer, 209--222. Google ScholarDigital Library
Jianyong Wang, Jiawei Han, and Jian Pei. 2003. CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In Proceedings of the 9^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). ACM Press, New York, 236--245. Google ScholarDigital Library
Xifeng Yan and Jiawei Han. 2002. gSpan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM'02). IEEE Computer Society, 721--724. Google ScholarDigital Library
Xifeng Yan and Jiawei Han. 2003. CloseGraph: Mining closed frequent graph patterns. In Proceedings of the 9^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). ACM Press, New York, 286--295. Google ScholarDigital Library
Guizhen Yang. 2004. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In Proceedings of the 10^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). ACM Press, New York, 344--353. Google ScholarDigital Library
Mohammed Javeed Zaki and Ching-Jiu Hsiao. 2002. CHARM: An efficient algorithm for closed itemset mining. In Proceedings of the 2^nd SIAM International Conference on Data Mining (SDM'02). 457--473.Google ScholarCross Ref

Index Terms

The Complexity of Mining Maximal Frequent Subgraphs
1. Information systems
  1. Information systems applications
    1. Data mining
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Graph algorithms

Recommendations

The complexity of mining maximal frequent subgraphs
PODS '13: Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI symposium on Principles of database systems

A frequent subgraph of a given collection of graphs is a graph that is isomorphic to a subgraph of at least as many graphs in the collection as a given threshold. Frequent subgraphs generalize frequent itemsets and arise in various contexts, from ...
Read More
The complexity of mining maximal frequent itemsets and maximal frequent patterns
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Mining maximal frequent itemsets is one of the most fundamental problems in data mining. In this paper we study the complexity-theoretic aspects of maximal frequent itemset mining, from the perspective of counting the number of solutions. We present the ...
Read More
Improvised Apriori with frequent subgraph tree for extracting frequent subgraphs
Soft computing and intelligent systems: Tools, techniques and applications

Graphs are considered to be one of the best studied data structures in discrete mathematics and computer science. Hence, data mining on graphs has become quite popular in the past few years. The problem of finding frequent itemsets in conventional data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 39, Issue 4
Invited Articles Issue, SIGMOD 2013, PODS 2013 and ICDT 2013
December 2014
341 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2691190
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 December 2014
- Accepted: 1 February 2014
- Received: 1 October 2013
Published in tods Volume 39, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Graph mining
enumeration complexity
maximal frequent subgraphs
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 362
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The Complexity of Mining Maximal Frequent Subgraphs

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

The complexity of mining maximal frequent subgraphs

The complexity of mining maximal frequent itemsets and maximal frequent patterns

Improvised Apriori with frequent subgraph tree for extracting frequent subgraphs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The Complexity of Mining Maximal Frequent Subgraphs

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

The complexity of mining maximal frequent subgraphs

The complexity of mining maximal frequent itemsets and maximal frequent patterns

Improvised Apriori with frequent subgraph tree for extracting frequent subgraphs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media