Abstract
A frequent subgraph of a given collection of graphs is a graph that is isomorphic to a subgraph of at least as many graphs in the collection as a given threshold. Frequent subgraphs generalize frequent itemsets and arise in various contexts, from bioinformatics to the Web. Since the space of frequent subgraphs is typically extremely large, research in graph mining has focused on special types of frequent subgraphs that can be orders of magnitude smaller in number, yet encapsulate the space of all frequent subgraphs. Maximal frequent subgraphs (i.e., the ones not properly contained in any frequent subgraph) constitute the most useful such type.
In this article, we embark on a comprehensive investigation of the computational complexity of mining maximal frequent subgraphs. Our study is carried out by considering the effect of three different parameters: possible restrictions on the class of graphs; a fixed bound on the threshold; and a fixed bound on the number of desired answers. We focus on specific classes of connected graphs: general graphs, planar graphs, graphs of bounded degree, and graphs of bounded treewidth (trees being a special case). Moreover, each class has two variants: that in which the nodes are unlabeled, and that in which they are uniquely labeled. We delineate the complexity of the enumeration problem for each of these variants by determining when it is solvable in (total or incremental) polynomial time and when it is NP-hard. Specifically, for the labeled classes, we show that bounding the threshold yields tractability but, in most cases, bounding the number of answers does not, unless P=NP; an exception is the case of labeled trees, where bounding either of these two parameters yields tractability. The state of affairs turns out to be quite different for the unlabeled classes. The main (and most challenging to prove) result concerns unlabeled trees: we show NP-hardness, even if the input consists of two trees and both the threshold and the number of desired answers are equal to just two. In other words, we establish that the following problem is NP-complete: given two unlabeled trees, do they have more than one maximal subtree in common?
- Laszlo Babai and Eugene M. Luks. 1983. Canonical labeling of graphs. In Proceedings of the 15th Annual ACM Symposium on Theory of Computing (STOC'83). ACM Press, New York, 171--183. Google ScholarDigital Library
- Endre Boros, Vladimir Gurvich, Leonid Khachiyan, and Kazuhisa Makino. 2003. On maximal frequent and minimal infrequent sets in binary matrices. Ann. Math. Artif. Intell. 39, 3, 211--221. Google ScholarDigital Library
- Sara Cohen, Itzhak Fadida, Yaron Kanza, Benny Kimelfeld, and Yehoshua Sagiv. 2006. Full disjunctions: Polynomial-delay iterators in action. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). ACM Press, New York, 739--750. Google ScholarDigital Library
- Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis. 2005. Frequent substructure-based approaches for classifying chemical compounds. IEEE Trans. Knowl. Data Engin. 17, 8, 1036--1050. Google ScholarDigital Library
- Rodney G. Downey and Michael R. Fellows. 1999. Parameterized Complexity. Springer. Google ScholarDigital Library
- Michael R. Garey, David S. Johnson, and Robert Endre Tarjan. 1976. The planar hamiltonian circuit problem is np-complete. SIAM J. Comput. 5, 4, 704--714.Google ScholarDigital Library
- Gianluigi Greco, Antonella Guzzo, Giuseppe Manco, and Domenico Sacca. 2005. Mining and reasoning on workflows. IEEE Trans. Knowl. Data Engin. 17, 4, 519--534. Google ScholarDigital Library
- Gianluigi Greco, Antonella Guzzo, Giuseppe Manco, and Domenico Sacca. 2007. Mining unconnected patterns in workflows. Inf. Syst. 32, 5, 685--712. Google ScholarDigital Library
- Ehud Gudes, Solomon Eyal Shimony, and Natalia Vanetik. 2006. Discovering frequent graph patterns using disjoint paths. IEEE Trans. Knowl. Data Engin. 18, 11, 1441--1456. Google ScholarDigital Library
- Dimitrios Gunopulos, Roni Khardon, Heikki Mannila, Sanjeev Saluja, Hannu Toivonen, and Ram Sewak Sharm. 2003. Discovering all most specific sentences. ACM Trans. Database Syst. 28, 2, 140--174. Google ScholarDigital Library
- John E. Hopcroft and Robert Endre Tarjan. 1972. Isomorphism of planar graphs. In Complexity of Computer Computations. The IBM Research Symposia Series. Plenum Press, New York, 131--152.Google Scholar
- Jun Huan, Wei Wang, Jan Prins, and Jiong Yang. 2004. SPIN: Mining maximal frequent subgraphs from graph databases. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). ACM Press, New York, 581--586. Google ScholarDigital Library
- Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD'00). Lecture Notes in Computer Science, vol. 1910, Springer, 13--23. Google ScholarDigital Library
- Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2003. Complete mining of frequent patterns from graphs: Mining graph data. Mach. Learn. 50, 3, 321--354. Google ScholarDigital Library
- David S. Johnson, Mihalis Yannakakis, and Christos H. Papadimitriou. 1988. On generating all maximal independent sets. Inf. Process. Lett. 27, 119--123. Google ScholarDigital Library
- Leonid Khachiyan, Endre Boros, Konrad Borys, Khaled M. Elbassioni, and Vladimir Gurvich. 2008. Generating all vertices of a polyhedron is hard. Discr. Comput. Geom. 39, 1--3, 174--190. Google ScholarDigital Library
- Benny Kimelfeld and Phokion G. Kolaitis. 2013. The complexity of mining maximal frequent subgraphs. In Proceedings of the 32nd Symposium on Principles of Database Systems (PODS'13). ACM Press, New York, 13--24. Google ScholarDigital Library
- Benny Kimelfeld and Yehoshua Sagiv. 2007. Maximally joining probabilistic data. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'07). ACM Press, New York, 303--312. Google ScholarDigital Library
- Michihiro Kuramochi and George Karypis. 2001. Frequent subgraph discovery. In Proceedings of the IEEE International Conference on Data Mining (ICDM'01). IEEE Computer Society, 313--320. Google ScholarDigital Library
- Michihiro Kuramochi and George Karypis. 2004. An efficient algorithm for discovering frequent subgraphs. IEEE Trans. Knowl. Data Engin. 16, 9, 1038--1051. Google ScholarDigital Library
- Kazuhisa Makino and Toshihide Ibaraki. 1996. Interior and exterior functions of boolean functions. Discr. Appl. Math. 69, 3, 209--231. Google ScholarDigital Library
- Jiri Matousek and Robin Thomas. 1992. On the complexity of finding iso- and other morphisms for partial k-trees. Discr. Math. 108, 1--3, 343--364. Google ScholarDigital Library
- Raymond J. Mooney, Prem Melville, Rupert Lapoon Tang, Jude Shavlik, Inłs Dutra, David Page, and Vitor Santos Costa. 2004. Relational data mining with inductive logic programming for link discovery. In Data Mining: Next Generation Challenges and Future Directions, AAAI Press, 239--254.Google Scholar
- Siegfried Nijssen and Joost N. Kok. 2004. Frequent graph mining and its application to molecular databases. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC'04). Vol. 5, 4571--4577.Google Scholar
- Yoshio Okamoto, Takeaki Uno, and Ryuhei Uehara. 2008. Counting the number of independent sets in chordal graphs. J. Discr. Algor. 6, 2, 229--242. Google ScholarDigital Library
- Jian Pei, Jiawei Han, and Runying Mao. 2000. CLOSET: An efficient algorithm for mining frequent closed itemsets. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. 21--30.Google Scholar
- Alina Stoica and Christophe Prieur. 2009. Structure of neighborhoods in a large social network. In Proceedings of the International Conference on Computational Science and Engineering (CSE'09). IEEE Computer Society, 26--33. Google ScholarDigital Library
- Lini T. Thomas, Satyanarayana R. Valluri, and Kamalakar Karlapalem. 2010. MARGIN: Maximal frequent subgraph mining. ACM Trans. Knowl. Discov. Data 4, 3. Google ScholarDigital Library
- Seinosuke Toda and Mitsunori Ogiwara. 1992. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput. 21, 2, 316--328. Google ScholarDigital Library
- Leslie G. Valiant. 1979a. The complexity of computing the permanent. Theor. Comput. Sci. 8, 189--201.Google ScholarCross Ref
- Leslie G. Valiant. 1979b. The complexity of enumeration and reliability problems. SIAM J. Comput. 8, 3, 410--421.Google ScholarDigital Library
- Fabian Wagner. 2011. Graphs of bounded treewidth can be canonized in ac1 computer science. In Proceedings of the 6th International Conference on Computer Science: Theory and Applications (CSR'11). Lecture Notes in Computer Science, vol. 6651, Springer, 209--222. Google ScholarDigital Library
- Jianyong Wang, Jiawei Han, and Jian Pei. 2003. CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). ACM Press, New York, 236--245. Google ScholarDigital Library
- Xifeng Yan and Jiawei Han. 2002. gSpan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM'02). IEEE Computer Society, 721--724. Google ScholarDigital Library
- Xifeng Yan and Jiawei Han. 2003. CloseGraph: Mining closed frequent graph patterns. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). ACM Press, New York, 286--295. Google ScholarDigital Library
- Guizhen Yang. 2004. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). ACM Press, New York, 344--353. Google ScholarDigital Library
- Mohammed Javeed Zaki and Ching-Jiu Hsiao. 2002. CHARM: An efficient algorithm for closed itemset mining. In Proceedings of the 2nd SIAM International Conference on Data Mining (SDM'02). 457--473.Google ScholarCross Ref
Index Terms
- The Complexity of Mining Maximal Frequent Subgraphs
Recommendations
The complexity of mining maximal frequent subgraphs
PODS '13: Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI symposium on Principles of database systemsA frequent subgraph of a given collection of graphs is a graph that is isomorphic to a subgraph of at least as many graphs in the collection as a given threshold. Frequent subgraphs generalize frequent itemsets and arise in various contexts, from ...
The complexity of mining maximal frequent itemsets and maximal frequent patterns
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data miningMining maximal frequent itemsets is one of the most fundamental problems in data mining. In this paper we study the complexity-theoretic aspects of maximal frequent itemset mining, from the perspective of counting the number of solutions. We present the ...
Improvised Apriori with frequent subgraph tree for extracting frequent subgraphs
Soft computing and intelligent systems: Tools, techniques and applicationsGraphs are considered to be one of the best studied data structures in discrete mathematics and computer science. Hence, data mining on graphs has become quite popular in the past few years. The problem of finding frequent itemsets in conventional data ...
Comments