G-Tries: a data structure for storing and finding subgraphs

Ribeiro, Pedro; Silva, Fernando

doi:10.1007/s10618-013-0303-4

G-Tries: a data structure for storing and finding subgraphs

Published: 12 February 2013

Volume 28, pages 337–377, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Pedro Ribeiro¹ &
Fernando Silva¹

1916 Accesses
40 Citations
Explore all metrics

Abstract

The ability to find and count subgraphs of a given network is an important non trivial task with multidisciplinary applicability. Discovering network motifs or computing graphlet signatures are two examples of methodologies that at their core rely precisely on the subgraph counting problem. Here we present the g-trie, a data-structure specifically designed for discovering subgraph frequencies. We produce a tree that encapsulates the structure of the entire graph set, taking advantage of common topologies in the same way a prefix tree takes advantage of common prefixes. This avoids redundancy in the representation of the graphs, thus allowing for both memory and computation time savings. We introduce a specialized canonical labeling designed to highlight common substructures and annotate the g-trie with a set of conditional rules that break symmetries, avoiding repetitions in the computation. We introduce a novel algorithm that takes as input a set of small graphs and is able to efficiently find and count them as induced subgraphs of a larger network. We perform an extensive empirical evaluation of our algorithms, focusing on efficiency and scalability on a set of diversified complex networks. Results show that g-tries are able to clearly outperform previously existing algorithms by at least one order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Approach for Counting Occurring Induced Subgraphs

Condensed Graphs: A Generic Framework for Accelerating Subgraph Census Computation

Introducing VF3: A New Algorithm for Subgraph Isomorphism

Notes

G-Tries source code is available at: http://www.dcc.fc.up.pt/gtries/
Fanmod is available at http://theinf1.informatik.uni-jena.de/~wernicke/motifs/.
Kavosh source code is available in http://lbb.ut.ac.ir/Download/LBBsoft/Kavosh/.
The source code is available at http://sites.google.com/site/andrealancichinetti/files.

References

Adamic LA, Glance N (2005) The political blogosphere and the 2004 U.S. election: divided they blog. In: 3rd International workshop on link discovery (LinkKDD). ACM, New York, pp 36–43
Albert I, Albert R (2004) Conserved network motifs allow protein–protein interaction prediction. Bioinformatics 20(18):3346–3352
Article Google Scholar
Albert R, Barabasi AL (2002) Statistical mechanics of complex networks. Rev Modern Phys 74(1):47–97. doi:10.1103/RevModPhys.74.47
Google Scholar
Arenas A (2011) Network data sets. http://deim.urv.cat/aarenas/data/welcome.htm
Batagelj V, Mrvar A (2006) Pajek datasets. http://vlado.fmf.uni-lj.si/pub/networks/data/
Borgelt C, Berthold MR (2002) Mining molecular fragments: finding relevant substructures of molecules. In: 2nd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC
Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, Li G, Chen R (2003) Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Res 31(9):2443–2450
Article Google Scholar
Cha M, Haddadi H, Benevenuto F, Gummadi KP (2010) Measuring user influence in twitter: the million follower fallacy. In: 4th International AAAI conference on weblogs and social media (ICWSM)
Chen J, Hsu W, Lee ML, Ng SK (2006) Nemofinder: dissecting genome-wide protein–protein interactions with meso-scale network motifs. In: 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 106–115
Ciriello G, Guerra C (2008) A review on models and algorithms for motif discovery in protein–protein interaction networks. Briefings Funct Genomics 7(2):147–156
Article Google Scholar
Cook SA (1971) The complexity of theorem-proving procedures. In: 3rd Annual ACM symposium on theory of computing, STOC ’71. ACM, New York, pp 151–158
da Costa LF, Rodrigues FA, Travieso G, Boas PRV (2007) Characterization of complex networks: a survey of measurements. Adv Phys 56:167
Google Scholar
Duch J, Arenas A (2005) Community detection in complex networks using extremal optimization. Phys Rev E (Stat Nonlinear Soft Matter Phys) 72:027,104
Google Scholar
Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499
Article Google Scholar
Grochow J, Kellis M (2007) Network motif discovery using subgraph enumeration and symmetry-breaking. Res Comput Mol Biol 92–106
Howe D (2010) Foldoc, free online dictionary of computing. http://foldoc.org/
Huan J, Bandyopadhyay D, Prins J, Snoeyink J, Tropsha A, Wang W (2006) Distance-based identification of structure motifs in proteins using constrained frequent subgraph mining. In: IEEE Symposium on computational intelligence in bioinformatics and computational biology (CIBCB)
Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: 3rd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC, p 549
Kärkkäinen L (2008) Yet another java vs. c++ shootout. http://zi.fi/shootout/
Kashani Z, Ahrabian H, Elahi E, Nowzari-Dalini A, Ansari E, Asadi S, Mohammadi S, Schreiber F, Masoudi-Nejad A (2009) Kavosh: a new algorithm for finding network motifs. BMC Bioinform 10(1):318
Article Google Scholar
Kashtan N, Itzkovitz S, Milo R, Alon U (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758
Article Google Scholar
Köbler J, Schöning U, Torán J (1993) The graph isomorphism problem: its structural complexity (Progress in Theoretical Computer Science). Birkhauser Verlag, Basel
Book MATH Google Scholar
Lacroix V, Fernandes CG, Sagot MF (2006) Motif search in graphs: application to metabolic networks. IEEE/ACM Trans Comput Biol Bioinform 3(4):360–368
Google Scholar
Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E (Stat Nonlinear Soft Matter Phys) 78(4):046,110
Google Scholar
Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM (2003) The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Can geographic isolation explain this unique trait? Behav Ecol Sociobiol 54(4):396–405
Article Google Scholar
McKay B (1981) Practical graph isomorphism. Congressus Numerantium 30:45–87
MathSciNet Google Scholar
McKay B (1998) Isomorph-free exhaustive generation. J Algorithms 26(2):306–324
Article MATH MathSciNet Google Scholar
Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, Ayzenshtat I, Sheffer M, Alon U (2004) Superfamilies of evolved and designed networks. Science 303(5663):1538–1542
Article Google Scholar
Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827
Article Google Scholar
Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Notices 42:89–100
Article Google Scholar
Newman M (2009) Network data. http://www-personal.umich.edu/mejn/netdata/
Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256. doi:10.1137/S003614450342480
Google Scholar
Newman MEJ (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E (Stat Nonlinear Soft Matter Phys) 74(3):036,104
Google Scholar
Nijssen S, Kok JN (2004) Frequent graph mining and its application to molecular databases. In: 2004 IEEE International conference on systems, man and cybernetics, vol 5. doi:10.1109/ICSMC.2004.1401252
Norlen K, Lucas G, Gebbie M, Chuang J (2002) EVA: extraction, visualization and analysis of the telecommunications and media ownership network. In: International telecommunications society 14th biennial conference (ITS). International Telecommunications Society, Seoul
Omidi S, Schreiber F, Masoudi-Nejad A (2009) Moda: an efficient algorithm for network motif discovery in biological networks. Genes Genetic Syst 84(5):385–395
Article Google Scholar
Pasquier N, Bastide Y, Taouil R, Lakhal L. (1999) Discovering frequent closed itemsets for association rules. In: ICDT ’99: 7th international conference on database theory. Springer, London, pp 398–416
Pržulj N (2007) Biological network comparison using graphlet degree distribution. Bioinformatics 23:e177–e183
Article Google Scholar
Reitz J (2002) Odlis: online dictionary of library and information science. http://vlado.fmf.uni-lj.si/pub/networks/data/dic/odlis/odlis.pdf
Ribeiro P, Silva F (2010) Efficient subgraph frequency estimation with g-tries. In: International workshop on algorithms in bioinformatics (WABI), LNCS. Springer, vol 6293, pp 238–249
Ribeiro P, Silva F (2010) G-tries: n efficient data structure for discovering network motifs. In: 25th ACM symposium on applied computing (SAC). ACM, pp 1559–1566
Ribeiro P, Silva F (2012) Querying subgraph sets with g-tries. In: 2nd ACM SIGMOD workshop on databases and social networks. ACM 25–30. doi:10.1145/2304536.2304541.
Ribeiro P, Silva F, Kaiser M (2009) Strategies for network motifs discovery. In: 5th IEEE international conference on e-science. IEEE Computer Society Press, Oxford, pp 80–87
Ribeiro P, Silva F, Lopes L (2010) Efficient parallel subgraph counting using g-tries. In: IEEE International conference on cluster computing (Cluster). IEEE Computer Society Press, pp 1559–1566
Ribeiro P, Silva F, Lopes L (2012) Parallel discovery of network motifs. J Parallel Distrib Comput 72:144–154
Article Google Scholar
Schreiber F, Schwobbermeyer H (2004) Towards motif detection in networks: frequency concepts and flexible search. In: International workshop on network tools and applications in biology (NETTAB), pp 91–102
Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31(1):64–68
Article Google Scholar
Sporns O, Kotter R (2004) Motifs in brain networks. PLoS Biol 2(11):e369. doi:10.1371/journal.pbio.0020369
Tarjan R (1971) Depth-first search and linear graph algorithms. In: Annual IEEE symposium on foundations of computer science. IEEE Computer Society, Los Alamitos, pp 114–121
Valverde S, Solé RV (2005) Network motifs in computational graphs: A case study in software architecture. Phys Rev E 72(2), 026107. doi:10.1103/PhysRevE.72.026107
Wang C, Parthasarathy S (2004) Parallel algorithms for mining frequent structural motifs in scientific data. In: ACM International conference on supercomputing (ICS)
Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):440–442
Article Google Scholar
Wernicke S (2005) A faster algorithm for detecting network motifs. In: International workshop on algorithms in bioinformatics (WABI), LNCS. Springer, vol 3692, pp. 165–177
Wernicke S (2006) Efficient detection of network motifs. IEEE/ACM Trans Comput Biol Bioinform 3(4):347–359
Article Google Scholar
White JG, Southgate E, Thomson JN, Brenner S (1986) The structure of the nervous system of the Nematode Caenorhabditis elegans. Philos Trans R Soc London B Biol Sci 314(1165):1–340
Google Scholar
Yan X, Han J (2002) gspan: graph-based substructure pattern mining. In: 2nd IEEE International conference on data mining (ICDM). IEEE Computer Society Press, Washington, DC, p 721
Yan X, Yu PS, Han J (2004) Graph indexing: a frequent structure-based approach. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04. ACM, New York, pp 335–346
Yuan D, Mitra P (2011) A lattice-based graph index for subgraph search. In: 14th International workshop on the web and databases (WebDB)

Download references

Acknowledgments

We thank the reviewers for their constructive and helpful suggestions, which helped in improving the quality of this manuscript. Pedro Ribeiro is funded by an FCT Research Grant (SFRH/BPD/81695/2011).

Author information

Authors and Affiliations

CRACS & INESC-TEC, Faculdade de Ciencias, Universidade do Porto, R. Campo Alegre, 1021, 4169-007 , Porto, Portugal
Pedro Ribeiro & Fernando Silva

Authors

Pedro Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Ribeiro.

Additional information

Responsible editor: M. J. Zaki.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ribeiro, P., Silva, F. G-Tries: a data structure for storing and finding subgraphs. Data Min Knowl Disc 28, 337–377 (2014). https://doi.org/10.1007/s10618-013-0303-4

Download citation

Received: 07 May 2012
Accepted: 16 January 2013
Published: 12 February 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s10618-013-0303-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

G-Tries: a data structure for storing and finding subgraphs

Abstract

Access this article

Similar content being viewed by others

An Efficient Approach for Counting Occurring Induced Subgraphs

Condensed Graphs: A Generic Framework for Accelerating Subgraph Census Computation

Introducing VF3: A New Algorithm for Subgraph Isomorphism

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

G-Tries: a data structure for storing and finding subgraphs

Abstract

Access this article

Similar content being viewed by others

An Efficient Approach for Counting Occurring Induced Subgraphs

Condensed Graphs: A Generic Framework for Accelerating Subgraph Census Computation

Introducing VF3: A New Algorithm for Subgraph Isomorphism

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation