Skip to main content
Log in

Scalable subgraph enumeration in MapReduce: a cost-oriented approach

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Subgraph enumeration, which aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph, is a fundamental graph problem with a wide range of applications. However, existing sequential algorithms for subgraph enumeration fall short in handling large graphs due to the involvement of computationally intensive subgraph isomorphism operations. Thus, some recent researches focus on solving the problem using MapReduce. Nevertheless, exiting MapReduce approaches are not scalable to handle very large graphs since they either produce a huge number of partial results or consume a large amount of memory. Motivated by this, in this paper, we propose a new algorithm \(\mathsf {Twin}\) \(\mathsf {Twig}\) \(\mathsf {Join}\) based on a left-deep-join framework in MapReduce, in which the basic join unit is a \(\mathsf {Twin}\) \(\mathsf {Twig}\) (an edge or two incident edges of a node). We show that in the Erdös–Rényi random graph model, \(\mathsf {Twin}\) \(\mathsf {Twig}\) \(\mathsf {Join}\) is instance optimal in the left-deep-join framework under reasonable assumptions, and we devise an algorithm to compute the optimal join plan. We further discuss how our approach can be adapted to handle the power-law random graph model. Three optimization strategies are explored to improve our algorithm. Ultimately, by aggregating equivalent nodes into a compressed node, we construct the compressed graph, upon which the subgraph enumeration is further improved. We conduct extensive performance studies in several real graphs, one of which contains billions of edges. Our approach significantly outperforms existing solutions in all tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Note that we apply a threshold \(\sqrt{M}\) in line 2 and line 13, and we will discuss the threshold later. Let’s first assume that there is no threshold.

  2. Lines 26–28 deal with the node with degree larger than the threshold (line 2,13).

  3. The case is much easier when p is an edge, we hence focus on the two-edge \(\mathsf {Twin}\) \(\mathsf {Twig}\) in the following.

References

  1. Afrati, F.N., Fotakis, D., Ullman, J.D.: Enumerating subgraph instances using map-reduce. In: Proceedings of the ICDE’13 (2013)

  2. Ahmed, N.K., Neville, J., Rossi, R.A., Duffield, N., Willke, T.L.: Graphlet Decomposition: Framework, Algorithms, and Applications. ArXiv e-prints (2015)

  3. Aiello, W., Chung, F., Lu, L.: A random graph model for massive graphs. In: Proceedings of the STOC’00 (2000)

  4. Alon, N., Dao, P., Hajirasouliha, I., Hormozdiari, F., Sahinalp, S.C.: Biomolecular network motif counting and discovery by color coding. In: Proceedings of the ISMB’08 (2008)

  5. Bhuiyan, M.A., Hasan, M.A.: An iterative mapreduce based frequent subgraph mining algorithm. TKDE 27(3), 608–620 (2015)

    Google Scholar 

  6. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  7. Chiba, N., Nishizeki, T.: Arboricity and subgraph listing algorithms. SIAM J. Comput. 14(1), 210–223 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  8. Chung, F.R.K., Lu, L., Vu, V.H.: The spectra of random graphs with given expected degrees. Internet Math. 1(3), 6313–6318 (2003)

    MathSciNet  MATH  Google Scholar 

  9. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  10. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the OSDI’04 (2004)

  11. Erdos, P., Renyi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. 38(4), 343–347 (1960)

    MathSciNet  MATH  Google Scholar 

  12. Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: from intractable to polynomial time. PVLDB 3(1), 264–275 (2010)

    Google Scholar 

  13. Gonen, M., Ron, D., Shavitt, Y.: Counting stars and other small subgraphs in sublinear time. In: Proceedings of the SODA’10 (2010)

  14. Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Proceedings of the RECOMB’07 (2007)

  15. Gonzalez, J., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the OSDI’12 (2012)

  16. Han, W.S., Lee, J., Lee, J.H.: Turboiso: Towards ultrafast and robust subgraph isomorphism search in large graph databases. In: Proceedings of the SIGMOD’13 (2013)

  17. He, H., Singh, A.K.: Graphs-at-a-time: query language and access methods for graph databases. In: Proceedings of the SIGMOD’08 (2008)

  18. Kairam, S.R., Wang, D.J., Leskovec, J.: The life and death of online groups: predicting group growth and longevity. In: Proceedings of the WSDM’12 (2012)

  19. Khan, A., Wu, Y., Aggarwal, C.C., Yan, X.: Nema: Fast graph search with label similarity. PVLDB 6(3), 181–190 (2013)

    Google Scholar 

  20. Lai, L., Qin, L., Lin, X., Chang, L.: Scalable subgraph enumeration in mapreduce. Proc. VLDB Endow. 8(10), 974–985 (2015)

  21. Lee, J., Han, W.S., Kasperovics, R., Lee, J.H.: An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB 6(2), 133–144 (2012)

    Google Scholar 

  22. Leskovec, J., Singh, A., Kleinberg, J.: Patterns of influence in a recommendation network. In: Proceedings of the PAKDD’06 (2006)

  23. Lin, W., Xiao, X., Gabriel, G.: Large-scale frequent subgraph mining in mapreduce. In: ICDE, pp. 844–855 (2014)

  24. Ma, S., Cao, Y., Huai, J., Wo, T.: Distributed graph pattern matching. In: WWW (2012)

  25. Milenkovic, T., Przulj, N.: Uncovering biological network function via graphlet degree signatures. Cancer Inf. 6, 257–273 (2008)

    Google Scholar 

  26. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002)

    Article  Google Scholar 

  27. Plantenga, T.: Inexact subgraph isomorphism in mapreduce. J. Parallel Distrib. Comput. 73(2), 164–175 (2013)

    Article  Google Scholar 

  28. Przulj, N.: Biological network comparison using graphlet degree distribution. Bioinformatics 23(2), 177–183 (2007)

    Article  Google Scholar 

  29. Rahman, M., Bhuiyan, M.A., Hasan, M.A.: Graft: an efficient graphlet counting method for large graph analysis. TKDE 26(10), 2466–2478 (2014)

    Google Scholar 

  30. Ren, X., Wang, J.: Exploiting vertex relationships in speeding up subgraph isomorphism over large graphs. Proc. VLDB Endow. 8(5), 617–628 (2015)

    Article  Google Scholar 

  31. Rücker, G., Rücker, C.: Substructure, subgraph, and walk counts as measures of the complexity of graphs and molecules. J. Chem. Info. Comput. Sci. 41(6), 1457–1462 (2001)

    Article  MATH  Google Scholar 

  32. Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., Borgwardt, K.: Efficient graphlet kernels for large graph comparison. In: AISTATS (2009)

  33. Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. PVLDB 1(1), 364–375 (2008)

    Google Scholar 

  34. Steinbrunn, M., Moerkotte, G., Kemper, A.: Optimizing Join Orders. Tech. rep. (1993)

  35. Sun, Z., Wang, H., Wang, H., Shao, B., Li, J.: Efficient subgraph matching on billion node graphs. PVLDB 5(9), 788–789 (2012)

    Google Scholar 

  36. Suri, S., Vassilvitskii, S.: Counting triangles and the curse of the last reducer. In: Proceedings of the WWW’11 (2011)

  37. Tsourakakis, C.E., Kang, U., Miller, G.L., Faloutsos, C.: Doulion: Counting triangles in massive graphs with a coin. In: Proceedings of the KDD’09 (2009)

  38. Viger, F., Latapy, M.: Efficient and simple generation of random simple connected graphs with prescribed degree sequence. In: COCOON’05, pp. 440–449 (2005)

  39. Wang, J., Cheng, J.: Truss decomposition in massive networks. PVLDB 5(9), 812–823 (2012)

    Google Scholar 

  40. Watts, D., Strogatz, S.: Collective dynamics of ’small-world’ networks. Nature 6684(393), 440–442 (1998)

    Article  Google Scholar 

  41. Zhao, P., Han, J.: On graph query optimization in large networks. PVLDB 3(1–2), 340–451 (2010)

    Google Scholar 

  42. Zhao, Z., Khan, M., Kumar, V.S.A., Marathe, M.V.: Subgraph enumeration in large social contact networks using parallel color coding and streaming. In: Proceedings of the ICPP’10 (2010)

Download references

Acknowledgements

Funding was provided by Australian Research Council (Grant Nos. DE140100999, DP120104168, DE150100563) and National Natural Science Foundation of China (Grant No. 61232006).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Longbin Lai.

Appendix

Appendix

1.1 Proofs in Sect. 5

Proof of Lemma 2 (Sect. 5.2)

Suppose \(P_i\) contains \(n_i\) nodes and \(m_i\) edges, we have \(|R(P_{i-1})|=\frac{(2M)^{m_{i-1}}}{N^{2m_{i-1}-n_{i-1}}}\) and \(|R(P_i)|=\frac{(2M)^{m_i}}{N^{2m_i-n_i}}\). Let \({\varDelta }m_i=m_i-m_{i-1}\) and \({\varDelta }n_i=n_i-n_{i-1}\), we have

$$\begin{aligned} |R(P_i)|=|R(P_{i-1})|\times \left( \frac{2M}{N^2}\right) ^{{\varDelta }m_i}\times N^{{\varDelta }n_i}. \end{aligned}$$
(8)

Since \(\mathcal{D}\) is a strong \(\mathsf {Twin}\) \(\mathsf {Twig}\) decomposition, there are three cases for \(p_i\) (\(1\le i \le t\)):

  • (\(|E(p_i)|=1\) and \(|V(p_i)\cap V(P_{i-1})|=2\)): In this case, \({\varDelta }m_i=1\) and \({\varDelta }n_i=0\). It follows that

    $$\begin{aligned} |R(P_i))|=|R(P_{i-1})|\times \frac{2M}{N^2} < |R(P_{i-1})|\times \frac{(2M)^2}{N^3}. \end{aligned}$$
  • (\(|E(p_i)|=2\) and \(|V(p_i)\cap V(P_{i-1})|=2\)): In this case, \({\varDelta }m_i=2\) and \({\varDelta }n_i=1\). It follows that

    $$\begin{aligned} |R(P_i))|= & {} |R(P_{i-1})|\times \left( \frac{2M}{N^2}\right) ^2\times N \\= & {} |R(P_{i-1})|\times \frac{(2M)^2}{N^3}. \end{aligned}$$
  • (\(|E(p_i)|=2\) and \(|V(p_i)\cap V(P_{i-1})|=3\)): In this case, \({\varDelta }m_i=2\) and \({\varDelta }n_i=0\). It follows that

    $$\begin{aligned} |R(P_i))|=|R(P_{i-1})|\times \left( \frac{2M}{N^2}\right) ^2 < |R(P_{i-1})|\times \frac{(2M)^2}{N^3}. \end{aligned}$$

In all the above three cases, we have \(|R(P_i)| \le |R(P_{i-1})| \times \frac{(2M)^2}{N^3}\). As a result, \(|R(P_i)| \le |R(P_{i-1})| \times \frac{(2M)^2}{N^3} \le |R(P_{i-2})| \times (\frac{(2M)^2}{N^3})^2 \le \cdots \le |R(p_0)|\times (\frac{(2M)^2}{N^3})^i\). \(\square \)

Proof of Corollary 1 (Sect. 5.2)

By the assumption \(A_3\) (\(d=2M/N<\sqrt{N}\)), we know that \(\frac{(2M)^2}{N^3} = \frac{d^2}{N} < 1\). It is immediate that Corollary 1 holds according to Lemma 2. \(\square \)

Proof of Corollary 1 (Sect. 5.2)

For any pattern decomposition \(\mathcal{D}\), we divide \({\mathsf {cost}} (\mathcal{D}) = 3 {\varSigma }_{i=1}^t |R(P_i)| + {\varSigma }_{i=0}^t |R(p_i)| + t |E(G)|\) (Eq. 1) into two parts:

  • \({\mathsf {cost}} _1(\mathcal{D}) = {\varSigma }_{i=0}^t |R(p_i)| + t |E(G)|\).

  • \({\mathsf {cost}} _2(\mathcal{D}) = 3 {\varSigma }_{i=1}^t |R(P_i)|\).

Accordingly, we divide the proof into two parts:

(Part 1): We prove \({\mathsf {cost}} _1(\mathcal{D}) \le {\varTheta }({\mathsf {cost}} _1(\mathcal{D'}))\). We only need to prove \({\mathsf {cost}} _1(\mathcal{D}^i)\le {\varTheta }({\mathsf {cost}} _1(\{p'_i\}))\) for each \(0\le i \le t'\). Note that when \(|E(p'_i)|\le 2\), \({\mathsf {cost}} _1(\mathcal{D}^i)={\mathsf {cost}} _1(\{p'_i\})\), thus, we only consider \(|E(p'_i)|\ge 3\). In this case, we have:

  • \({\mathsf {cost}} _1(\mathcal{D}^i)\le {\varTheta }(t'_i\cdot d^2\cdot N)\). According to Lemma 1, we know that each pattern \(p_j^i \in D^i\) is a \(\mathsf {Twin}\) \(\mathsf {Twig}\) with \(|R(p_j^i)| \le \frac{(2M)^2}{N} = {\varTheta }(d^2 \cdot N)\). Hence, we have

    $$\begin{aligned} cost_1(D^i) {=} {\varSigma }_{j=1}^{\lceil t'_i/2 \rceil }(|R(p_j^i)| {+} |E(G)|) {\le } {\varTheta }\left( t'_i \cdot d^2 \cdot N\right) . \end{aligned}$$
  • \({\mathsf {cost}} _1(\{p'_i\}) \ge {\varTheta }(t'_i\cdot d^3\cdot N)\). This is because

    $$\begin{aligned} \begin{aligned} {\mathsf {cost}} _1(\{p'_i\})&\ge |R(p'_i)|=d^{t'_i}\times N \ge (t'_i-2)\times d^3\times N \\&\ge t'_i/3\times d^3\times N \;\;\; \left( \text {by }t'_i=|E(p'_i)|\ge 3\right) \\&= {\varTheta }\left( t'_i\cdot d^3\cdot N\right) . \end{aligned} \end{aligned}$$

Thus, \({\mathsf {cost}} _1(\mathcal{D}^i) \le {\varTheta }({\mathsf {cost}} _1(\{p'_i\}))\).

(Part 2): We prove \({\mathsf {cost}} _2(\mathcal{D}) = {\varTheta }({\mathsf {cost}} _2(\mathcal{D'}))\). We reformulate \({\mathsf {cost}} _2(\mathcal{D'})\) as \(3(\frac{p'_0}{2} + \frac{{\varSigma }_{i=1}^{t'}|R(P'_{i-1})|+|R(P'_i)|}{2} + \frac{|R(P'_{t'})|}{2})\). Thus,

$$\begin{aligned} {\mathsf {cost}} _2(\mathcal{D}')={\varTheta }\left( {\varSigma }_{i=1}^{t'} \left( |R(P'_{i-1})|+|R(P'_i)|\right) \right) . \end{aligned}$$
(9)

Note that in \(\mathcal{D}\) that is constructed based on \(\mathcal{D}'\), we will gradually combine \(p^i_1, p^i_2, \ldots , p^i_{t_i}\) to \(P'_{i-1}\) in order to get \(P'_i\). Hence, the term \(|R(P'_{i-1})|+|R(P'_i)|\) for each \(1\le i \le t'\) in \({\mathsf {cost}} _2(\mathcal{D}')\) is replaced by

$$\begin{aligned} \begin{aligned} {\mathsf {cost}} ^i_2 (\mathcal{D})=&|R \left( P'_{i-1}\right) |+|R\left( P'_{i-1} \cup p^i_1\right) | \\&+ \cdots + |R\left( P'_{i-1} \cup p^i_1 \cup \cdots \cup p^i_{t_i-1}\right) | \\&+ |R \left( P'_i\right) |. \end{aligned} \end{aligned}$$
(10)

Recall that there exists a \(k_i\) such that, when \(1 \le j \le k_i\), \(p^i_j\) is a strong \(\mathsf {Twin}\) \(\mathsf {Twig}\), and when \(k_i < j \le t_i\), \(p^i_j\) is a non-strong \(\mathsf {Twin}\) \(\mathsf {Twig}\). Let \(x = k_i\) and \(y = t_i - k_i\), then there are \(x+y+1\) terms in \({\mathsf {cost}} ^i_2(\mathcal{D})\). We have,

  • (\(S_1\)): The sum of the first \(x+1\) terms in \({\mathsf {cost}} ^i_2(\mathcal{D})\) is \({\varTheta }(|R(P'_{i-1})|)\). Since each \(p^i_j\) is a strong \(\mathsf {Twin}\) \(\mathsf {Twig}\), according to Lemma 2 and Corollary 1, when j increases, the size of the j-th term decreases exponentially with a rate \(\le \frac{(2M)^2}{N^3}<1\), thus, statement \(S_1\) holds.

  • (\(S_2\)): The sum of the last y terms in \({\mathsf {cost}} ^i_2(\mathcal{D})\) is \({\varTheta }(|R(P'_i)|)\). Since each \(p^i_j\) is a non-strong \(\mathsf {Twin}\) \(\mathsf {Twig}\), according to Eq. 8, when j increases, the size of the j-th term increases exponentially with a rate \(\ge d>1\), thus, statement \(S_2\) holds.

Based on \(S_1\) and \(S_2\), we have \({\mathsf {cost}} _2(\mathcal{D}) = {\varTheta }({\mathsf {cost}} _2(\mathcal{D'}))\), and therefore, Theorem 1 holds. \(\square \)

Proof of Lemma 3 (Sect. 5.3)

We first prove the space complexity. Each entry \((P',\mathcal{D'}, {\underline{{\mathsf {cost}}}}(\mathcal{D'}, P))\) in \(\mathcal{H}\) is uniquely identified by the partial pattern \(P'\), and there are at most \(2^m\) partial patterns, which consumes at most \(O(2^m)\) space. Note that each \(P'\) and \(\mathcal{D}'\) can be stored using constant space by only keeping the last \(\mathsf {Twin}\) \(\mathsf {Twig}\) p that generates \(P'\) and \(\mathcal{D}'\), and a link to the entry identified by \(P'-p\).

Next we prove the time complexity. Let s be the possible number of \(\mathsf {Twin}\) \(\mathsf {Twig}\)s in P, we have

$$\begin{aligned} s = {\varSigma }_{v\in V(P)} d(v)^2 \le {\varSigma }_{v\in V(P)} d(v) \times {\overline{d}} = 2m \times {\overline{d}}. \end{aligned}$$

When an entry is popped out from \(\mathcal{H}\), it can be expanded at most s times. Using a Fibonacci heap, pop works in \(\log (|\mathcal{H}|)\) time, and update and push both work in O(1) time. Thus, the overall time complexity is

$$\begin{aligned} O(2^m \cdot (s + \log (|\mathcal{H}|))) = O({\overline{d}} \cdot m \cdot 2^m). \end{aligned}$$

\(\square \)

1.2 Instance optimality of \(\mathsf {Twin}\) \(\mathsf {Twig}\) \(\mathsf {Join}\) in the power-law random graph (Sect. 6).

To show that the instance optimality of \(\mathsf {Twin}\) \(\mathsf {Twig}\) \(\mathsf {Join}\) in power-law graphs, we prove that Theorem 1 holds in a power-law random graph model. Following the same proof structure as Theorem 1, we divide the proof into the following two parts: In part 1, we prove that \({\mathsf {cost}} _1(\mathcal{D})\le {\varTheta }({\mathsf {cost}} _1(\mathcal{D}'))\), and in part 2, we prove that \({\mathsf {cost}} _2(\mathcal{D})={\varTheta }({\mathsf {cost}} _2(\mathcal{D}'))\). In order to prove part 2, we still compare Eqs. 9 and 10, and then prove the two cases, namely \(S_1\): the size of the results decreases after joining a strong \(\mathsf {Twin}\) \(\mathsf {Twig}\); \(S_2\): the size of the results increases after joining a non-strong \(\mathsf {Twin}\) \(\mathsf {Twig}\). Below is the detailed proof.

(Part 1): Let p be a two-edge \(\mathsf {Twin}\) \(\mathsf {Twig}\),Footnote 3 we have

$$\begin{aligned} \begin{aligned}&{\mathsf {cost}} _1(\mathcal{D}^i)={\varTheta }(|R(p)|\cdot t_i') \text { and, }\\&{\mathsf {cost}} _1(\{p_i'\})={\varTheta }(|R(p)|\cdot {\mathbb {E}}[d(u)^{t_i'-2}]) \\&\quad \ge {\varTheta }(|R(p)|\cdot {\mathbb {E}}[d(u)]^{t_i'-2})={\varTheta }(|R(p)|\cdot d^{t_i'-2}). \end{aligned} \end{aligned}$$

where \({\mathbb {E}}[d(u)]\) is the expected degree for an arbitrary node u in V(G). Given that \(d\ge 2\) and \(t_i'\ge 3\), it is easy to see that \({\mathsf {cost}} _1(\mathcal{D}^i) \le {\mathsf {cost}} _1(\{p_i'\})\) for each \(0\le i \le t'\), which results in \({\mathsf {cost}} _1(\mathcal{D})\le {\varTheta }({\mathsf {cost}} _1(\mathcal{D}'))\). Therefore, part 1 is proved.

Fig. 15
figure 15

Values of \(\gamma \) in different parameter combinations. a Vary N: \(d = 5\). b Vary N: \(d = 10\). c Vary N: \(d = 100\). d Vary N: \(d = 500\)

(Part 2): For a certain pattern decomposition, we consider generating \(R(P_i)\) using \(R(P_{i-1})\) and \(R(p_i)\). Suppose \(\gamma \) is the expected number of matches in \(R(P_i)\) that are generated from a certain match in \(R(P_{i-1})\), we have

$$\begin{aligned} |R(P_i)|=\gamma |R(P_{i-1})|. \end{aligned}$$
(11)

The value of \(\gamma \) depends on how \(p_i\) is joined with \(P_{i-1}\). Suppose \(p_i=\{(v,v'),(v,v'')\}\), in order to prove part 2, we need to prove the following \(S_1\) and \(S_2\) accordingly.

(\(S_1\)): We prove that \(\gamma <1\) when \(p_i\) is a strong \(\mathsf {Twin}\) \(\mathsf {Twig}\) with \(v'\in V(P_{i-1})\) and \(v''\in V(P_{i-1})\). When \(v\in V(P_{i-1})\), \(\gamma <1\) can be easily proved since no new node is added into \(V(P_i)\). When \(v\notin V(P_{i-1})\), suppose \(u'\) and \(u''\) are arbitrary matches of \(v'\) and \(v''\), respectively, we have

$$\begin{aligned} \begin{aligned} \gamma&={\mathbb {E}}[{\varSigma }_{u\in V(G)} d(u')d(u)\rho \times d(u'')d(u)\rho ] \\&= {\mathbb {E}}[d(u')d(u'')]\times \rho ^2{\varSigma }_{i=1}^Nw_i^2. \end{aligned} \end{aligned}$$

In order to calculate \(\gamma \), we simplify the calculation of \({\mathbb {E}}[d(u')d(u'')]\) by only considering the relationship between \(u'\) and \(u''\). There are two cases:

First, there is no edge between \(v'\) and \(v''\) in \(P_{i-1}\), and we consider that their matches, \(u'\) and \(u''\), are independent. In this case, \({\mathbb {E}}[d(u')d(u'')]={\mathbb {E}}[d(u')]{\mathbb {E}}[d(u'')]=d^2\). We have

$$\begin{aligned} \gamma =d^2\times \rho ^2{\varSigma }_{i=1}^Nw_i^2=\frac{{\varSigma }_{i=1}^Nw_i^2}{N^2}. \end{aligned}$$
(12)

According to \(A_4\), \(w_i\le d_\mathrm{max}\le \sqrt{N}\), therefore, \(\gamma <\frac{d_\mathrm{max}^2}{N}\le 1\).

Second, there is an edge between \(v'\) and \(v''\) in \(P_{i-1}\). In this case, \(u'\) and \(u''\) must have an edge in the data graph. Using the Bayes equation, we can derive the equation:

$$\begin{aligned} \begin{aligned}&P(u'=u_i, u''=u_j|u',u'' \text { form an edge}) \\&\quad = \frac{P(u',u'' \text { form an edge } |u'=u_i,u''=u_j)\times P(u'=u_i, u''=u_j)}{P(u',u'' \text { form an edge})}\\&\quad =\frac{P_{i,j}\times (1/N^2)}{2M/N^2}=\rho P_{i,j}. \end{aligned} \end{aligned}$$

As a result, we have

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[d(u')d(u'')]&= {\varSigma }_{i,j=1}^N \rho P_{i,j} w_iw_j \\&=\rho ^2 \left( {\varSigma }_{i=1}^Nw_i^2{\varSigma }_{j=1}^Nw_j^2\right) =\rho ^2 \left( {\varSigma }_{i=1}^Nw_i^2\right) ^2. \end{aligned} \end{aligned}$$

Therefore, \(\gamma \) can be calculated as

$$\begin{aligned} \gamma =\rho ^2\left( {\varSigma }_{i=1}^Nw_i^2\right) ^2\times \rho ^2{\varSigma }_{i=1}^Nw_i^2 =\frac{\left( {\varSigma }_{i=1}^Nw_i^2\right) ^3}{\left( {\varSigma }_{i=1}^Nw_i\right) ^4}. \end{aligned}$$
(13)

It is hard to compute an upper bound for \(\gamma \) in this case. However, we show that \(\gamma <1\) for most real-world graphs. In order to do so, we vary \(\beta \) from 2.1 to 2.9, d from 5 to 500, and N from 10,000 to 100,000,000. Since \(\gamma \) increases with \(d_\mathrm{max}\), we set \(d_\mathrm{max}=\sqrt{N}\). With \(\beta \), d, N, and \(d_\mathrm{max}\), we can generate \(w_i (1\le i \le N)\) via [38], and thus, \(\gamma \) can be calculated via Eq. 13. The results are shown in Fig. 15, in which we can see that \(\gamma <1\) for all practical cases.

(\(S_2\)): We prove that \(\gamma >1\) when \(p_i\) is a non-strong \(\mathsf {Twin}\) \(\mathsf {Twig}\) with \(u\in V(P_{i-1})\), \(u'\notin V(P_{i-1})\), and \(u''\notin V(P_{i-1})\). In this situation, we have

$$\begin{aligned} \begin{aligned} \gamma&={\mathbb {E}}[{\varSigma }_{u',u''\in V(G)}d(u)d(u')\rho \times d(u)d(u'')\rho ] \\&={\mathbb {E}}[d(u)^2]\rho ^2{\varSigma }_{i,j=1}^Nw_iw_j={\mathbb {E}}[d(u)^2] = {\varSigma }_{i=1}^N w_i^2/N. \end{aligned}\nonumber \\ \end{aligned}$$
(14)

Obviously, \(\gamma \ge {\mathbb {E}}[d(u)]^2=d^2>1\). Now according to \(S_1\) and \(S_2\), part 2 is proved when \(p_i\) is a two-edge \(\mathsf {Twin}\) \(\mathsf {Twig}\).

According to Part 1 and Part 2, the instance optimality of the \(\mathsf {Twin}\) \(\mathsf {Twig}\) \(\mathsf {Join}\) holds for a power-law random graph.

1.3 Proofs in Sect. 8

Proof of Proposition 1 (Sect. 8.1)

(i) is apparently true, and the proof of (iii) is similar to (ii); hence, we concentrate on (ii) here.

(If) Let \(S_1\) be the set of nodes aggregated on \({\mathcal {N}}[u]\) in \({\mathsf {reduce}} ^1\) (Algorithm 3) and \(S_2\) be the set of nodes aggregated on \({\mathcal {N}}(u)\) in \({\mathsf {reduce}} ^2\) (Algorithm 3). If \({\mathcal {S}}(u)\) is a clique compressed node, we show that (1) \({\mathcal {S}}(u) = S_1\), and (2) \(S_2 = \{u\}\).

(1) On the one way, \(\forall u' \in {\mathcal {S}}(u)\), and \(u' \ne u\), we know \({\mathcal {N}}[u'] = {\mathcal {N}}[u]\), and \(u'\) must be aggregated in \({\mathsf {reduce}} ^1\) (Algorithm 3) on the key \({\mathcal {N}}[u]\). Thus, \(u' \in S_1\), and as a result, \(S_1 \subseteq {\mathcal {S}}(u)\). On the other way, \(\forall u' \in S_1\) and \(u' \ne u\), we have \({\mathcal {N}}[u] = {\mathcal {N}}[u']\), leading to \({\mathcal {N}}(u') \setminus \{u\} = {\mathcal {N}}(u) \setminus \{u'\}\). According to Definiton 13 and Definiton 14, we have \(u' \in {\mathcal {S}}(u)\). As a result, \({\mathcal {S}}(u) \subseteq S_1\). Conclusively, \({\mathcal {S}}(u) = S_1\) holds. Note that we only output the record in line 10 (Algorithm 3) for \(u_{s_1}\), the minimum node in \(S_1\) (also the representative node of \({\mathcal {S}}(u)\)). Therefore, we have \({\mathsf {out}} ^1(u)\) as shown in 2.

(2) It suffices to show that \(\not \exists u' \ne u\), such that \({\mathcal {N}}(u') = {\mathcal {N}}(u)\). We prove this by contradiction. Suppose there is such a \(u'\). By \({\mathcal {N}}(u') = {\mathcal {N}}(u)\), we must have \(u' \not \in {\mathcal {N}}(u)\). As \({\mathcal {S}}(u)\) is a clique compressed node, \(\exists u'' \ne u'\) and \(u'' \ne u\), such that \({\mathcal {N}}[u''] = {\mathcal {N}}[u]\). We hence have \(u'' \in {\mathcal {N}}(u) \Rightarrow u'' \in {\mathcal {N}}(u') \Rightarrow u' \in {\mathcal {N}}(u'') \Rightarrow u' \in {\mathcal {N}}(u)\). This draws a contradiction. As a result, there are not nodes but u itself gathered in \({\mathsf {reduce}} ^2\) (Algorithm 3), and we have \({\mathsf {out}} ^2\) as shown in 2.

(Only If) While \(u = r_{{\mathcal {S}}(u)}\) and having \({\mathsf {out}} ^1(u) = (u; ({\mathsf {\boxtimes }}, {\mathcal {S}}(u)))\), it is apparent that \({\mathcal {S}}(u)\) is a clique compressed node. Otherwise, \({\mathsf {out}} ^1(u) = \emptyset \). Clearly u does not belong to a trivial compressed node, as otherwise 1 is expected. Additionally, \({\mathcal {S}}(u)\) cannot be an independent compressed node, as \(out^2(u)\) would never be associated with a “\({\mathsf {\times }}\)” if this is the case. Therefore, \({\mathcal {S}}(u)\) must be a clique compressed node. \(\square \)

Proof of Lemma 4 (Sect. 8.1)

It is clear that each trivial compressed node \({\mathcal {S}}= \{u\}\) will be output in \({\mathsf {reduce}} ^3\) (Algorithm 3) on the key u. Consider a non-trivial compressed node \({\mathcal {S}}= \{u_{s_1}, u_{s_2}, \ldots , u_{s_k}\}\). According to Proposition 1, \({\mathsf {reduce}} ^3\) (Algorithm 3) will receive two values only on the key \(u_{s_1}\), where the compressed node \({\mathcal {S}}\) is generated with \(u_{s_1}\) as the representative node. Therefore, the lemma holds. \(\square \)

Proof of Lemma 5 (Sect. 8.1)

Given any compressed edge \(({\mathcal {S}}, {\mathcal {S}}') \in E(G^*)\), we show it is returned by Algorithm 4. Let \(u = r_{{\mathcal {S}}}\). On the one hand, \({\mathsf {map}} ^2_2\) (Algorithm 4) outputs \((u; ({\mathsf {\in }}, {\mathcal {S}}))\). On the other hand, we have \(u \in {\mathcal {N}}^o({\mathcal {S}}')\) when \(({\mathcal {S}}, {\mathcal {S}}') \in E(G^*)\). As a result, \({\mathsf {map}} ^2_1\) (Algorithm 4) involves \((u; ({\mathsf {\rightarrow }}, {\mathcal {S}}'))\) in the output. Finally, the above two key-value pairs arrive at \({\mathsf {reduce}} ^2\) (Algorithm 4), and the corresponding compressed edge is binded. \(\square \)

Proof of Lemma 6 (Sect. 8.1)

In MapReduce, communication cost is triggered by transferring the output data of each mapper to the reducer. In Algorithm 3, \({\mathsf {map}} ^1\) and \({\mathsf {map}} ^2\) output the neighbors for each node, and the cost is O(M), and \({\mathsf {map}} ^3\) outputs each node with its compressed node, and the cost is \(O(N\cdot \overline{|{\mathcal {S}}|})\), where \(\overline{|{\mathcal {S}}|}\) is the average size of the compressed nodes. In Algorithm 4, \({\mathsf {map}} _1^1\) outputs each node with its neighbors and \({\mathsf {map}} _2^1\) outputs the representative node with its compressed node. They contribute to \(O(M + N \cdot \overline{|{\mathcal {S}}|})\) cost. As for \({\mathsf {map}} _1^2\) and \({\mathsf {map}} _2^2\), we can simply use \(r_{{\mathcal {S}}(u)}\) to represent \({\mathcal {S}}(u)\); hence, they render the same cost as the first stage. To summarize, the overall communication cost of the construction of compressed graph is \(O(M + N \cdot \overline{|{\mathcal {S}}|})\), or simply \(O(M + N)\) considering that \(\overline{|{\mathcal {S}}|}\) is often small. \(\square \)

Proof of Corollary 2 (Sect. 8.2)

Given a match of \(p (u_0, u_1, u_2)\), such that the corresponding compressed match \(({\mathcal {S}}(u_0), {\mathcal {S}}(u_1), {\mathcal {S}}(u_2))\) satisfies \({\mathcal {S}}(u_0) = {\mathcal {S}}\). As a valid match of p, we must have \((u_0, u_1) \in E(G)\) and \((u_0, u_2) \in E(G)\). There are four cases for the compressed match.

  • \({\mathcal {S}}= {\mathcal {S}}(u_0) = {\mathcal {S}}(u_1) = {\mathcal {S}}(u_2)\). In this case, \({\mathcal {S}}\) at least includes \(\{u_0, u_1, u_2\}\). Further, we have \({\mathcal {S}}.\)clique=true due to \((u_0, u_1) \in E(G)\). This compressed match is handled in line 2 in Algorithm 5.

  • \({\mathcal {S}}= {\mathcal {S}}(u_1)\) or \({\mathcal {S}}= {\mathcal {S}}(u_2)\). In this case, \({\mathcal {S}}\) has at least two nodes and similarly \({\mathcal {S}}.clique=\)true. This compressed match is processed in line 4.

  • \({\mathcal {S}}\ne {\mathcal {S}}(u_1) = {\mathcal {S}}(u_2)\). Note that \({\mathcal {S}}(u_1) \in {\mathcal {N}}^*({\mathcal {S}})\), and this case is covered in line 5.

  • \({\mathcal {S}}\ne {\mathcal {S}}(u_1) \ne {\mathcal {S}}(u_2)\). Both compressed nodes are \({\mathcal {S}}\)’s neighbors. Algorithm 5 covers this case in line 7 by enumerating the pairs of compressed nodes in \({\mathcal {S}}\)’s compressed neighbors.

Summarizing the above cases, Algorithm 5 returns all \(R^*_{\mathcal {S}}(p)\). It is obvious that \(R^*(p) = \bigcup _{{\mathcal {S}}\in V(G^*)} R^*_{{\mathcal {S}}}(p)\). This completes the proof. \(\square \)

Proof of Lemma 7 (Sect. 8.2)

Following the pattern decomposition \({\mathcal {D}}(P) = \{p_0, p_1, \ldots , p_t\}\), the algorithm processes t rounds. We prove this lemma by making inductions on the MapReduce rounds.

Initially, it is round 0 where \(P_0\) is a \(\mathsf {Twin}\) \(\mathsf {Twig}\). The lemma holds as \({\mathsf {SubgEnumCompr}} \) correctly computes all compressed matches of a \(\mathsf {Twin}\) \(\mathsf {Twig}\) according to Corollary 2.

Suppose \({\mathsf {SubgEnumCompr}} \) correctly computes all compressed matches of \(P_{n - 1}\) in the \((n - 1)^{th}\) round, where \(1 < n \le t\). In this \(n^{th}\) round, we know that \({\mathsf {SubgEnumCompr}} \) will process the join \(R^*(P_{n}) = R^*(P_{n - 1}) \bowtie R^*(p_n)\). Let the join attributes be \(V_k = V(P_{n - 1}) \cap V(p_n)\) and \(V(P_n) = (V(P_{n - 1}) \setminus V_k, V_k, V(p_n) \setminus V_k)\). Given a match of \(P_n\)f—we divide it into three parts, namely \(f_{n - 1} = f(V(P_{n - 1}) \setminus V_k)\), \(f_k = f(V_k)\) and \(f_n = f(V(p_n) \setminus V_k))\), where \(f(V) = (f(v_1), f(v_2), \ldots )\) for all \(v_j \in V\). Define a bijective mapping \(\sigma : V(G) \mapsto V(G^*)\) such that \(\sigma (u) = {\mathcal {S}}(u)\) for all \(u \in V(G)\). The compressed match related to f can hence be written as \(f \circ \sigma = (f_{n-1} \circ \sigma , f_k \circ \sigma , f_n \circ \sigma )\). It is obviously that \((f_{n-1} \circ \sigma , f_k \circ \sigma ) \in R^*(P_{n - 1})\) and \((f_k \circ \sigma , f_n \circ \sigma ) \in R^*(p_n)\). According to the induction and Corollary 2, the algorithm correctly computes all \(R^*(P_{n - 1})\) and \(R^*(p_n)\). Therefore, \((f_{n-1} \circ \sigma , f_k \circ \sigma )\) and \((f_k \circ \sigma , f_n \circ \sigma )\) must have been computed and will be joined in this round on the key \(f_k \circ \sigma \) to generate the compressed match of f. In other words, any compressed match in \(R^*(P_n)\) that is related to a valid match will be correctly computed.

By induction, \({\mathsf {SubgEnumCompr}} \) correctly computes all compressed matches of P after t rounds of MapReduce. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lai, L., Qin, L., Lin, X. et al. Scalable subgraph enumeration in MapReduce: a cost-oriented approach. The VLDB Journal 26, 421–446 (2017). https://doi.org/10.1007/s00778-017-0459-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-017-0459-4

Keywords

Navigation