Mining frequent subgraphs from tremendous amount of small graphs using MapReduce

Zhe Peng¹,
Tongtong Wang¹,
Wei Lu¹,
Hao Huang²,
Xiaoyong Du¹,
Feng Zhao³ &
…
Anthony K. H. Tung³

718 Accesses
12 Citations
Explore all metrics

Abstract

Frequent subgraph mining from a tremendous amount of small graphs is a primitive operation for many data mining applications. Existing approaches mainly focus on centralized systems and suffer from the scalability issue. Consider the increasing volume of graph data and mining frequent subgraphs is a memory-intensive task, it is difficult to tackle this problem on a centralized machine efficiently. In this paper, we therefore propose an efficient and scalable solution, called MRFSE, using MapReduce. MRFSE adopts the breadth-first search strategy to iteratively extract frequent subgraphs, i.e., all frequent subgraphs with \(i+1\) edges are generated based on frequent subgraphs with i edges at the ith iteration. In our design, existing frequent subgraph mining techniques in centralized systems can be easily extended and integrated. More importantly, new frequent subgraphs are generated without performing any isomorphism test which is costly and imperative in existing frequent subgraph mining techniques. Besides, various optimization techniques are proposed to further reduce the communication and I/O cost. Extensive experiments conducted on our in-house clusters demonstrate the superiority of our proposed solution in terms of both scalability and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

A survey of density based clustering algorithms

Article 29 September 2020

Panthadeep Bhattacharjee & Pinaki Mitra

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

T. Ramalingeswara Rao, Pabitra Mitra, … A. Goswami

Notes

In the remainder of this paper, an i-subgraph is referred to as a subgraph with i edges.
http://dtp.nci.nih.gov/docs/aids/aids_data.html.
http://www.cas.org.
For every \(g \in D\), we maintain all frequent i-subgraphs associated with the corresponding embeddings.
https://github.com/apache/giraph.
http://pubchem.ncbi.nlm.nih.gov.

References

Aridhi S, d’Orazio L, Maddouri M, Nguifo EM (2015) Density-based data partitioning strategy to approximate large-scale subgraph mining. Inf Syst 48:213–223
Article Google Scholar
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242
Article Google Scholar
Bhuiyan M, Hasan MA (2015) An iterative mapreduce based frequent subgraph mining algorithm. IEEE Trans Knowl Data Eng 27(3):608–620
Article Google Scholar
Borgelt C, Berthold MR (2002) Mining molecular fragments: finding relevant substructures of molecules. In: ICDM, pp 51–58
Chaoji V, Hasan MA, Salem S, Zaki MJ (2008) An integrated, generic approach to pattern mining: data mining template library. Data Min Knowl Discov 17(3):457–495
Article MathSciNet Google Scholar
Cheng J, Ke Y, Ng W (2009) Efficient query processing on graph databases. ACM Trans Database Syst 34(1):2
Article Google Scholar
Cheng J, Ke Y, Ng W, Lu A(2007) Fg-index: towards verification-free query processing on graph databases. In: SIGMOD conference, pp 857–872
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI, pp 137–150
Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
Hill S, Srichandan B, Sunderraman R (2012) An iterative mapreduce approach to frequent subgraph mining in biological datasets. In: BCB, pp 661–666
Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: ICDM, pp 549–552
Huan J, Wang W, Prins J, Yang J (2004) Spin: mining maximal frequent subgraphs from graph databases. In: KDD, pp 581–586
Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: PKDD, pp 13–23
Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30
Article Google Scholar
Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: ICDM, pp 313–320
Lin W, Xiao X, Ghinita G (2014) Large-scale frequent subgraph mining in mapreduce. In: IEEE 30th international conference on data engineering, Chicago, ICDE 2014, IL, USA, 31 March–4 April, pp 844–855
Liu Y, Jiang X, Chen H, Ma J, Zhang X (2009) Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In: Advanced parallel processing technologies, 8th international symposium, APPT 2009, Rapperswil, Switzerland, Proceedings, 24–25 Aug, pp 341–355
Lowe DG (2001) Local feature view clustering for 3D object recognition. In: CVPR, pp 682–688
Lu W, Chen G, Tung AKH, Zhao F (2013) Efficiently extracting frequent subgraphs using mapreduce. In: Proceedings of the 2013 IEEE international conference on big data, Santa Clara, CA, USA, 6–9 Oct 2013, pp 639–647
National library of medicine. http://chem.sis.nlm.nih.gov/chemidplus
Nijssen S, Kok JN (2004) A quickstart in frequent structure mining can make a difference. In: KDD, pp 647–652
Petrakis EGM, Faloutsos C (1997) Similarity searching in medical image databases. IEEE Trans Knowl Data Eng 9(3):435–447
Article Google Scholar
Wang C, Wang W, Pei J, Zhu Y, Shi B (2004) Scalable mining of large disk-based graph databases. In: KDD, pp 316–325
Yan X, Han J (2002) gspan: graph-based substructure pattern mining. In: ICDM, pp 721–724
Yan X, Yu PS, Han J (2004) Graph indexing: a frequent structure-based approach. In: SIGMOD conference, pp 335–346

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their helpful and insightful comments. This work was in part supported by the National Natural Science Foundation of China (61502504, 61502347, 61432006), the Nature Science Foundation of Hubei Province of China (2016CFB384), the Ministry of Science and Technology of China, National Key Research and Development Program (Project Number: 2016YFB1000700) and the Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China No. 15XNLF09.

Author information

Authors and Affiliations

School of Information and Key Lab of Data Engineering and Knowledge Engineering, MOE, Renmin University of China, Beijing, China
Zhe Peng, Tongtong Wang, Wei Lu & Xiaoyong Du
State Key Laboratory of Software Engineering, Wuhan University, Wuhan, China
Hao Huang
School of Computing, National University of Singapore, Singapore, Singapore
Feng Zhao & Anthony K. H. Tung

Authors

Zhe Peng
View author publications
You can also search for this author in PubMed Google Scholar
Tongtong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author in PubMed Google Scholar
Feng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Anthony K. H. Tung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Lu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peng, Z., Wang, T., Lu, W. et al. Mining frequent subgraphs from tremendous amount of small graphs using MapReduce. Knowl Inf Syst 56, 663–690 (2018). https://doi.org/10.1007/s10115-017-1104-7

Download citation

Received: 28 July 2016
Revised: 15 May 2017
Accepted: 20 August 2017
Published: 06 October 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s10115-017-1104-7

Mining frequent subgraphs from tremendous amount of small graphs using MapReduce

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey of density based clustering algorithms

The big data system, components, tools, and technologies: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mining frequent subgraphs from tremendous amount of small graphs using MapReduce

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey of density based clustering algorithms

The big data system, components, tools, and technologies: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation