Skip to main content
Log in

A distributed approach for graph mining in massive networks

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph, ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Afrati FN, Fotakis D, Ullman JD (2013) Enumerating subgraph instances using map-reduce. In: IEEE international conference on data engineering

  • Bhuiyan M, Al Hasan M (2015) An iterative mapreduce based frequent subgraph mining algorithm. IEEE Trans Knowl Data Eng 27(3):608–620

    Article  Google Scholar 

  • Bringmann B, Nijssen S (2008) What is frequent in a single graph? In: Pacific-Asia conference on advances in knowledge discovery and data mining

  • Buehrer G, Parthasarathy S, Chen Y-K (2006) Adaptive parallel graph mining for cmp architectures. In: IEEE international conference on data mining

  • Elseidy M, Abdelhamid E, Skiadopoulos S, Kalnis P (2014) Grami: frequent subgraph and pattern mining in a single large graph. Proc VLDB Endow 7:517–528

    Article  Google Scholar 

  • Fatta GD, Berthold MR (2006) Dynamic load balancing for the distributed mining of molecular structures. IEEE Trans Parallel Distrib Syst 17(8):773–785

    Article  Google Scholar 

  • Hill S, Srichandan B, Sunderraman R (2012) An iterative mapreduce approach to frequent subgraph mining in biological datasets. In: ACM conference on bioinformatics, computational biology and biomedicine

  • Holder LB, Cook DJ (1993) Discovery of inexact concepts from structural data. IEEE Trans Knowl Data Eng 5(6):992–994

    Article  Google Scholar 

  • Huan J, Wang W, Prins J(2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: IEEE international conference on data mining

  • Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: Principles of data mining and knowledge discovery. LNCS vol. 1910. Springer, pp 13–23

  • Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392

    Article  MathSciNet  MATH  Google Scholar 

  • Kessl R, Talukder N, Anchuri P, Zaki MJ (2014) Parallel graph mining with GPUs. Proceedings of the BigMine workshop (ACM SIGKDD), Journal of Machine Learning Research: conference and workshop proceedings, pp 36:1–16

  • Kimelfeld B, Kolaitis PG (2014) The complexity of mining maximal frequent subgraphs. ACM Trans Database Syst (TODS) 39(4):32

    Article  MathSciNet  Google Scholar 

  • Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: IEEE international conference on data mining

  • Kuramochi M, Karypis G (2005) Finding frequent patterns in a large sparse graph. Data Min Knowl Discov 11(3):243–271

    Article  MathSciNet  Google Scholar 

  • Lin W, Xiao X, Ghinita G (2014) Large-scale frequent subgraph mining in mapreduce. In: IEEE international conference on data engineering

  • Liu Y, Jiang X, Chen H, Ma J, Zhang X (2009) Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In: Advanced parallel processing technologies, LNCS vol. 5737. Springer, pp 341–355

  • Lu W, Chen G, Tung A, Zhao F(2013) Efficiently extracting frequent subgraphs using mapreduce. In: IEEE international conference on big data

  • Meinl T, Wörlein M, Fischer I, Philippsen M (2006) Mining molecular datasets on symmetric multiprocessor systems. In: IEEE international conference on systems, man and cybernetics, vol 2

  • Reinhardt S, Karypis G (2007) A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph. In: IEEE international parallel and distributed processing symposium

  • Shahrivari S, Jalili S (2015) Distributed discovery of frequent subgraphs of a network using MapReduce. Computing 97(11):1101–1120

    Article  MathSciNet  MATH  Google Scholar 

  • Shao Y, Cui B, Chen L, Ma L, Yao J, Xu N (2014) Parallel subgraph listing in a large-scale graph. In: ACM SIGMOD international conference on management of data

  • Sun Z, Wang H, Wang H, Shao B, Li J (2012) Efficient subgraph matching on billion node graphs. Proc VLDB Endow 5(9):788–799

    Article  Google Scholar 

  • Teixeira CHC, Fonseca AJ, Serafini M, Siganos G, Zaki MJ, Aboulnaga A (2015) Arabesque: a system for distributed graph pattern mining. In: 25th ACM symposium on operating systems principles

  • Ucar D, Asur S, Catalyurek U, Parthasarathy S (2006) Improving functional modularity in protein–protein interactions graphs using hub-induced subgraphs. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Knowledge discovery in databases: PKDD 2006. Springer, Berlin, pp 371–382

    Chapter  Google Scholar 

  • Wu B, Bai Y (2010) An efficient distributed subgraph mining algorithm in extreme large graphs. In: International conference on artificial intelligence and computational intelligence: part I

  • Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. In: IEEE international conference on data mining

  • Yang G (2004) The complexity of mining maximal frequent itemsets and maximal frequent patterns. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 344–353

Download references

Acknowledgments

This work was supported by NSF Award IIS-1302231. We thank Chris Carothers and Bulent Yener for several discussions on the practical and theoretical aspects of our distributed algorithm.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. J. Zaki.

Additional information

Responsible editor: Thomas Gärtner, Mirco Nanni, Andrea Passerini and Celine Robardet.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Talukder, N., Zaki, M.J. A distributed approach for graph mining in massive networks. Data Min Knowl Disc 30, 1024–1052 (2016). https://doi.org/10.1007/s10618-016-0466-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-016-0466-x

Keywords

Navigation