Skip to main content
Log in

G-Tries: a data structure for storing and finding subgraphs

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The ability to find and count subgraphs of a given network is an important non trivial task with multidisciplinary applicability. Discovering network motifs or computing graphlet signatures are two examples of methodologies that at their core rely precisely on the subgraph counting problem. Here we present the g-trie, a data-structure specifically designed for discovering subgraph frequencies. We produce a tree that encapsulates the structure of the entire graph set, taking advantage of common topologies in the same way a prefix tree takes advantage of common prefixes. This avoids redundancy in the representation of the graphs, thus allowing for both memory and computation time savings. We introduce a specialized canonical labeling designed to highlight common substructures and annotate the g-trie with a set of conditional rules that break symmetries, avoiding repetitions in the computation. We introduce a novel algorithm that takes as input a set of small graphs and is able to efficiently find and count them as induced subgraphs of a larger network. We perform an extensive empirical evaluation of our algorithms, focusing on efficiency and scalability on a set of diversified complex networks. Results show that g-tries are able to clearly outperform previously existing algorithms by at least one order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. G-Tries source code is available at: http://www.dcc.fc.up.pt/gtries/

  2. Fanmod is available at http://theinf1.informatik.uni-jena.de/~wernicke/motifs/.

  3. Kavosh source code is available in http://lbb.ut.ac.ir/Download/LBBsoft/Kavosh/.

  4. The source code is available at http://sites.google.com/site/andrealancichinetti/files.

References

  • Adamic LA, Glance N (2005) The political blogosphere and the 2004 U.S. election: divided they blog. In: 3rd International workshop on link discovery (LinkKDD). ACM, New York, pp 36–43

  • Albert I, Albert R (2004) Conserved network motifs allow protein–protein interaction prediction. Bioinformatics 20(18):3346–3352

    Article  Google Scholar 

  • Albert R, Barabasi AL (2002) Statistical mechanics of complex networks. Rev Modern Phys 74(1):47–97. doi:10.1103/RevModPhys.74.47

    Google Scholar 

  • Arenas A (2011) Network data sets. http://deim.urv.cat/aarenas/data/welcome.htm

  • Batagelj V, Mrvar A (2006) Pajek datasets. http://vlado.fmf.uni-lj.si/pub/networks/data/

  • Borgelt C, Berthold MR (2002) Mining molecular fragments: finding relevant substructures of molecules. In: 2nd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC

  • Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, Li G, Chen R (2003) Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Res 31(9):2443–2450

    Article  Google Scholar 

  • Cha M, Haddadi H, Benevenuto F, Gummadi KP (2010) Measuring user influence in twitter: the million follower fallacy. In: 4th International AAAI conference on weblogs and social media (ICWSM)

  • Chen J, Hsu W, Lee ML, Ng SK (2006) Nemofinder: dissecting genome-wide protein–protein interactions with meso-scale network motifs. In: 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 106–115

  • Ciriello G, Guerra C (2008) A review on models and algorithms for motif discovery in protein–protein interaction networks. Briefings Funct Genomics 7(2):147–156

    Article  Google Scholar 

  • Cook SA (1971) The complexity of theorem-proving procedures. In: 3rd Annual ACM symposium on theory of computing, STOC ’71. ACM, New York, pp 151–158

  • da Costa LF, Rodrigues FA, Travieso G, Boas PRV (2007) Characterization of complex networks: a survey of measurements. Adv Phys 56:167

    Google Scholar 

  • Duch J, Arenas A (2005) Community detection in complex networks using extremal optimization. Phys Rev E (Stat Nonlinear Soft Matter Phys) 72:027,104

    Google Scholar 

  • Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499

    Article  Google Scholar 

  • Grochow J, Kellis M (2007) Network motif discovery using subgraph enumeration and symmetry-breaking. Res Comput Mol Biol 92–106

  • Howe D (2010) Foldoc, free online dictionary of computing. http://foldoc.org/

  • Huan J, Bandyopadhyay D, Prins J, Snoeyink J, Tropsha A, Wang W (2006) Distance-based identification of structure motifs in proteins using constrained frequent subgraph mining. In: IEEE Symposium on computational intelligence in bioinformatics and computational biology (CIBCB)

  • Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: 3rd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC, p 549

  • Kärkkäinen L (2008) Yet another java vs. c++ shootout. http://zi.fi/shootout/

  • Kashani Z, Ahrabian H, Elahi E, Nowzari-Dalini A, Ansari E, Asadi S, Mohammadi S, Schreiber F, Masoudi-Nejad A (2009) Kavosh: a new algorithm for finding network motifs. BMC Bioinform 10(1):318

    Article  Google Scholar 

  • Kashtan N, Itzkovitz S, Milo R, Alon U (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758

    Article  Google Scholar 

  • Köbler J, Schöning U, Torán J (1993) The graph isomorphism problem: its structural complexity (Progress in Theoretical Computer Science). Birkhauser Verlag, Basel

    Book  MATH  Google Scholar 

  • Lacroix V, Fernandes CG, Sagot MF (2006) Motif search in graphs: application to metabolic networks. IEEE/ACM Trans Comput Biol Bioinform 3(4):360–368

    Google Scholar 

  • Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E (Stat Nonlinear Soft Matter Phys) 78(4):046,110

    Google Scholar 

  • Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM (2003) The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Can geographic isolation explain this unique trait? Behav Ecol Sociobiol 54(4):396–405

    Article  Google Scholar 

  • McKay B (1981) Practical graph isomorphism. Congressus Numerantium 30:45–87

    MathSciNet  Google Scholar 

  • McKay B (1998) Isomorph-free exhaustive generation. J Algorithms 26(2):306–324

    Article  MATH  MathSciNet  Google Scholar 

  • Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, Ayzenshtat I, Sheffer M, Alon U (2004) Superfamilies of evolved and designed networks. Science 303(5663):1538–1542

    Article  Google Scholar 

  • Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827

    Article  Google Scholar 

  • Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Notices 42:89–100

    Article  Google Scholar 

  • Newman M (2009) Network data. http://www-personal.umich.edu/mejn/netdata/

  • Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256. doi:10.1137/S003614450342480

    Google Scholar 

  • Newman MEJ (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E (Stat Nonlinear Soft Matter Phys) 74(3):036,104

    Google Scholar 

  • Nijssen S, Kok JN (2004) Frequent graph mining and its application to molecular databases. In: 2004 IEEE International conference on systems, man and cybernetics, vol 5. doi:10.1109/ICSMC.2004.1401252

  • Norlen K, Lucas G, Gebbie M, Chuang J (2002) EVA: extraction, visualization and analysis of the telecommunications and media ownership network. In: International telecommunications society 14th biennial conference (ITS). International Telecommunications Society, Seoul

  • Omidi S, Schreiber F, Masoudi-Nejad A (2009) Moda: an efficient algorithm for network motif discovery in biological networks. Genes Genetic Syst 84(5):385–395

    Article  Google Scholar 

  • Pasquier N, Bastide Y, Taouil R, Lakhal L. (1999) Discovering frequent closed itemsets for association rules. In: ICDT ’99: 7th international conference on database theory. Springer, London, pp 398–416

  • Pržulj N (2007) Biological network comparison using graphlet degree distribution. Bioinformatics 23:e177–e183

    Article  Google Scholar 

  • Reitz J (2002) Odlis: online dictionary of library and information science. http://vlado.fmf.uni-lj.si/pub/networks/data/dic/odlis/odlis.pdf

  • Ribeiro P, Silva F (2010) Efficient subgraph frequency estimation with g-tries. In: International workshop on algorithms in bioinformatics (WABI), LNCS. Springer, vol 6293, pp 238–249

  • Ribeiro P, Silva F (2010) G-tries: n efficient data structure for discovering network motifs. In: 25th ACM symposium on applied computing (SAC). ACM, pp 1559–1566

  • Ribeiro P, Silva F (2012) Querying subgraph sets with g-tries. In: 2nd ACM SIGMOD workshop on databases and social networks. ACM 25–30. doi:10.1145/2304536.2304541.

  • Ribeiro P, Silva F, Kaiser M (2009) Strategies for network motifs discovery. In: 5th IEEE international conference on e-science. IEEE Computer Society Press, Oxford, pp 80–87

  • Ribeiro P, Silva F, Lopes L (2010) Efficient parallel subgraph counting using g-tries. In: IEEE International conference on cluster computing (Cluster). IEEE Computer Society Press, pp 1559–1566

  • Ribeiro P, Silva F, Lopes L (2012) Parallel discovery of network motifs. J Parallel Distrib Comput 72:144–154

    Article  Google Scholar 

  • Schreiber F, Schwobbermeyer H (2004) Towards motif detection in networks: frequency concepts and flexible search. In: International workshop on network tools and applications in biology (NETTAB), pp 91–102

  • Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31(1):64–68

    Article  Google Scholar 

  • Sporns O, Kotter R (2004) Motifs in brain networks. PLoS Biol 2(11):e369. doi:10.1371/journal.pbio.0020369

  • Tarjan R (1971) Depth-first search and linear graph algorithms. In: Annual IEEE symposium on foundations of computer science. IEEE Computer Society, Los Alamitos, pp 114–121

  • Valverde S, Solé RV (2005) Network motifs in computational graphs: A case study in software architecture. Phys Rev E 72(2), 026107. doi:10.1103/PhysRevE.72.026107

  • Wang C, Parthasarathy S (2004) Parallel algorithms for mining frequent structural motifs in scientific data. In: ACM International conference on supercomputing (ICS)

  • Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):440–442

    Article  Google Scholar 

  • Wernicke S (2005) A faster algorithm for detecting network motifs. In: International workshop on algorithms in bioinformatics (WABI), LNCS. Springer, vol 3692, pp. 165–177

  • Wernicke S (2006) Efficient detection of network motifs. IEEE/ACM Trans Comput Biol Bioinform 3(4):347–359

    Article  Google Scholar 

  • White JG, Southgate E, Thomson JN, Brenner S (1986) The structure of the nervous system of the Nematode Caenorhabditis elegans. Philos Trans R Soc London B Biol Sci 314(1165):1–340

    Google Scholar 

  • Yan X, Han J (2002) gspan: graph-based substructure pattern mining. In: 2nd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC, p 721

  • Yan X, Yu PS, Han J (2004) Graph indexing: a frequent structure-based approach. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04. ACM, New York, pp 335–346

  • Yuan D, Mitra P (2011) A lattice-based graph index for subgraph search. In: 14th International workshop on the web and databases (WebDB)

Download references

Acknowledgments

We thank the reviewers for their constructive and helpful suggestions, which helped in improving the quality of this manuscript. Pedro Ribeiro is funded by an FCT Research Grant (SFRH/BPD/81695/2011).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Ribeiro.

Additional information

Responsible editor: M. J. Zaki.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ribeiro, P., Silva, F. G-Tries: a data structure for storing and finding subgraphs. Data Min Knowl Disc 28, 337–377 (2014). https://doi.org/10.1007/s10618-013-0303-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0303-4

Keywords

Navigation