Skip to main content
Log in

Graph pattern matching with counting quantifiers and label-repetition constraints

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In recent years, we have witnessed an increasing use of graph pattern matching in a wide variety of applications such as social networks analysis, knowledge discovery, software plagiarism detection and many more. It is typically defined in terms of subgraph isomorphism, an NP-Complete problem. To overcome this cost, many extensions of graph simulation have been proposed that allow graph pattern matching to be conducted in cubic-time. However, in emerging applications, more expressive patterns are needed, notably ones with counting quantifiers (CQs) which are not considered by simulation-based approaches. In this article, we propose a simulation-based graph pattern matching approach that supports CQs on edges of graph patterns. We first consider CQs that express numeric aggregates only. We show that our approach is in ptime as earlier extensions of graph simulation by providing a cubic-time quantified matching algorithm, i.e., an algorithm for matching graph patterns that contain CQs. In the second part, we discuss the problem of Label-Repetition Constraints (LRCs). We define a necessary and sufficient condition for the satisfaction of LRCs. Based on this condition, we give an extension of our quantified matching algorithm to deal with LRCs in ptime, together with an optimization technique. Finally, we show that our quantified graph pattern matching approach retains the same complexity bounds when dealing with ratio aggregates. To our knowledge, this is the first effort to deal with numeric aggregates, ratio aggregates, and LRCs on graph patterns in ptime.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. \(M \oplus P\) = \((M\setminus P)\)\(\cup\)\((P\setminus M)\).

  2. It can be defined w.r.t S as explained in Sect. 2.

  3. These right-bounds are evaluated in a special order as we shall explain later.

  4. The restriction imposed in Sect. 1 (Remark 2) allows the definition of this notion.

  5. Recall that our QGPs respect the restriction given in Sect. 3.1 (Remark 1): their cycles contain at most one BCQ.

  6. First-order logic with Counting quantifiers.

  7. Names of persons are added for more clarification.

  8. I.e. \(\{(u,\downarrow ,l)\}\in \mathcal {LR}_P\) and \(\lambda _P(u_i)=l\) for \(i\in [1,k]\).

  9. I.e. \(\{(u,\uparrow ,l)\}\in \mathcal {LR}_P\) and \(\lambda _P(u_i)=l\) for \(i\in [1,k]\).

  10. Semantic of \(f_P(e)=[min\%,*]\) can be deducted similarly.

  11. The closest integer value greater than or equal to it.

  12. In practice, \(|V_P|\) is much smaller than |V|, then \(|E_P|^{2}|V_P|\) is bounded by \(|E_P|^{2}|V|\), which is also bounded by \(|E_P||P||G|\).

References

  1. Agrawal, H.: Some generalizations of distinct representatives with applications to statistical designs. Ann. Math. Stat. 2, 525–528 (1966)

    Article  MathSciNet  Google Scholar 

  2. Bapna, R., Umyarov, A.: Do your online friends make you pay? A randomized field experiment on peer influence in online social networks. Manag. Sci. 61(8), 1902–1920 (2015)

    Article  Google Scholar 

  3. Brynielsson, J., Högberg, J., Kaati, L., Mårtenson, C., Svenson, P.: Detecting social positions using simulation. In: ASONAM. pp. 48–55 (2010)

  4. Castelltort, A., Laurent, A.: Fuzzy historical graph pattern matching A nosql graph database approach for fraud ring resolution. In: AIAI, pp. 151–167 (2015)

  5. Cho, J., Shivakumar, N., Garcia-Molina, H.: Finding replicated web collections. In: SIGMOD, pp. 355–366 (2000)

  6. Coffman, T., Greenblatt, S., Marcus, S.: Graph-based technologies for intelligence analysis. Commun. ACM 47(3), 45–47 (2004)

    Article  Google Scholar 

  7. Cong, G., Fan, W., Kementsietsidis, A.: Distributed query evaluation with performance guarantees. In: SIGMOD, pp. 509–520 (2007)

  8. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1367–1372 (2004)

    Article  Google Scholar 

  9. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  10. Fan, W.: Graph pattern matching revised for social network analysis. In: ICDT, pp. 8–21 (2012)

  11. Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: from intractable to polynomial time. VLDB Endow. 3, 264–275 (2010)

    Article  Google Scholar 

  12. Fan, W., Li, J., Ma, S., Tang, N., Wu, Y.: Adding regular expressions to graph reachability and pattern queries. In: ICDE, pp. 39–50 (2011)

  13. Fan, W., Li, J., Wang, X., Wu, Y.: Query preserving graph compression. In: SIGMOD, pp. 157–168 (2012)

  14. Fan, W., Wang, X., Wu, Y.: Answering pattern queries using views. IEEE Trans. Knowl. Data Eng. 28(2), 326–341 (2016)

    Article  Google Scholar 

  15. Fan, W., Wu, Y., Xu, J.: Adding counting quantifiers to graph patterns. In: SIGMOD, pp. 1215–1230 (2016)

  16. Francis, N., Green, A., Guagliardo, P., Libkin, L., Lindaaker, T., Marsault, V., Plantikow, S., Rydberg, M., Selmer, P., Taylor, A.: Cypher: An evolving query language for property graphs. In: SIGMOD, pp. 1433–1445. ACM, New York (2018)

  17. Grujic, I., Bogdanovic Dinic, S., Stoimenov, L.: Collecting and analyzing data from e-government facebook pages. In: Proceedings of ICT Innovations, pp. 86–96 (2014)

  18. Hall, P.: On representatives of subsets. Lond. Math. Soc. 10(1), 26–30 (1935)

    Article  Google Scholar 

  19. Hannah Blau Neil Immerman, D.J.: A visual language for querying and updating graphs. Tech. Rep., University of Massachusetts. Technical Report (2002)

  20. Hopcroft, J.E., Karp, R.M.: An n5/2 algorithm for maximum matchings in bipartite graphs. SIAM J. Comput. 2(4), 225–231 (1973)

    Article  MathSciNet  Google Scholar 

  21. Liu, C., Chen, C., Han, J., Yu, P.S.: Gplag: Detection of software plagiarism by program dependence graph analysis. In: SIGKDD, pp. 872–881 (2006)

  22. Liu, G., Zheng, K., Wang, Y., Orgun, M.A., Liu, A., Zhao, L., Zhou, X.: Multi-constrained graph pattern matching in large-scale contextual social graphs. In: ICDE pp. 351–362 (2015)

  23. Ma, S., Cao, Y., Fan, W., Huai, J., Wo, T.: Strong simulation: capturing topology in graph pattern matching. ACM Trans. Database Syst. 39(1), 1–46 (2014)

    Article  MathSciNet  Google Scholar 

  24. Ma, S., Li, J., Hu, C., Liu, X., Huai, J.: Graph pattern matching for dynamic team formation. CoRR arXiv:abs/1801.01012 (2018)

  25. Maccioni, A., Abadi, D.J.: Scalable pattern matching over compressed graphs via dedensification. In: SIGKDD, pp. 1755–1764 (2016)

  26. Mahfoud, H.: Graph pattern matching preserving label-repetition constraints. In: MEDI, pp. 268–281 (2018)

  27. Mennicke, S., Kalo, J., Balke, W.: Querying graph databases: What do graph patterns mean? In: ER, pp. 134–148 (2017)

  28. Milner, R.: Communication and Concurrency. Prentice-Hall, Inc., Upper Saddle River (1989)

    MATH  Google Scholar 

  29. Onak, K., Rubinfeld, R.: Maintaining a large matching and a small vertex cover. In: STOC, pp. 457–464 (2010)

  30. Sankowski, P.: Faster dynamic matchings and vertex connectivity. In: SODA, pp. 118–126 (2007)

  31. Shemshadi, A., Sheng, Q.Z., Qin, Y.: Efficient pattern matching for graphs with multi-labeled nodes. Knowl. Based Syst. 109, 256–265 (2016)

    Article  Google Scholar 

  32. Tung, L.D., Nguyen-Van, Q., Hu, Z.: Efficient query evaluation on distributed graphs with hadoop environment. In: SoICT, pp. 311–319 (2013)

  33. Ullmann, J.R.: An algorithm for subgraph isomorphism. J. ACM 23(1), 31–42 (1976)

    Article  MathSciNet  Google Scholar 

  34. Vasilyeva, E., Thiele, M., Bornhövd, C., Lehner, W.: Answering “why empty?” and “why so many?” Queries in graph databases. J. Comput. Syst. Sci. 82(1), 3–22 (2016)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Houari Mahfoud.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices

1.1 Appendix 1: Proof of Proposition 2

It takes \(O(|V_P|+|E_P|)\) time to extract the list L from P and to define the structure BCQs (lines 1–5). Started at dependency level \(l=1\), for each edge \(e=(u,w)\) in L, we check if neither w nor one of its descendants is concerned by a BCQ in P (test of line 9). If so, e does not depend on any edge in L, and thus, it is added to the set \(\mathcal {DL}_P(l)\) (line 10). This classification process (lines 9–11) is done in \(O(|V_P|)\) time for a single edge, and \(O(|E_P||V_P|)\) time for all edges in L (lines 8–12). Each time an edge \(e=(u,w)\) is classified, it is removed from the list L and the cardinality BCQs(u) is decremented. This process is done in \(O(|E_P|)\) time (lines 13–16) over all classified edges. Since the While loop is repeated at most \(|E_P|\) times, then the overall time complexity of procedure DefineDL is given by \(O(|V_P|+|E_P|+|E_P|(|E_P||V_P| + |E_P|))\), which remains bounded by \(O(|P|+|E_P|^{2}|V_P|)\).

1.2 Appendix 2: Proof of Lemma 1

Given a QGP \(P=(V_P,E_P,\lambda _P)\) and data graph \(G=(V,E,\lambda )\). For each node \(u\in V_P\) and each potential match \(v\in V\) of u, we check whether v satisfies CQs defined over children of u (lines 2–7) as well as parent relationships defined over u (lines 8–13). This process takes \(O(|P^{\downarrow }(u)||V|)\) and \(O(|P^{\uparrow }(u)||V|)\) time respectively by parsing only edges incident to u in E. By considering all nodes of P, the first part of the procedure (2–13) takes \(O(|\bigcup _{u\in V_P} P^{\downarrow }(u)||V|+|\bigcup _{u\in V_P} P^{\uparrow }(u)||V|)\) time, which is bounded by \(O(|E_P||V|)\).

Recall that each time an incorrect match (uv) is deleted from \(S^{+}_{D}\), procedure UpdateStruct is called to update the structures \({\mathcal {M}}^{\downarrow }_{P,G}\) and \({\mathcal {M}}^{\uparrow }_{P,G}\). The cost of this procedure is given by \(O(|P^{\downarrow }(u)|.|G^{\downarrow }(v)|+|P^{\uparrow }(u)|.|G^{\uparrow }(v)|)\) time for each deleted match (uv). Each node u may have at most |V| incorrect matches, then the procedure requires at most \(O(|\bigcup _{v\in V} |P^{\downarrow }(u)|.|G^{\downarrow }(v)|+|P^{\uparrow }(u)|.|G^{\uparrow }(v)|)\) time to delete all possible incorrect matches of a node u in P, which is bounded by \(O(|P^{\downarrow }(u)|.|E|+|P^{\uparrow }(u)|.|E|)\) time. Therefore, the overall cost of procedure UpdateStruct, when considering all nodes in P, is bounded by \(O(|E_P||E|)\) time.

We assume that each incorrect match can be eliminated from \(S^{+}_{D}\) in O(1) time, which leads for \(O(|V_P||V|)\) time to delete all possible incorrect matches from this match relation.

For each couple (uv) deleted from \(S^{+}_{D}\), the second part of the procedure (lines 14–28) checks the correctness of only matches in \(S^{+}_{D}\) that correspond to adjacent nodes of u. We first consider children of u that are influenced by the deletion of (uv) (i.e. each child \(u_c\) of u that is matched by a child \(v_c\) of v). Checking the matches correctness of these children (lines 16–21) takes at most \(O(|P^{\downarrow }(u)||G^{\downarrow }(v)|)\) time. Similarly, checking the matches correctness of parents of u that are influenced by the deletion of (uv) (i.e. each parent \(u_p\) of u that is matched by a parent \(v_p\) of v) takes at most \(O(|P^{\uparrow }(u)||G^{\uparrow }(v)|)\) time (lines 22–27). That is, for each couple (uv) deleted from \(S^{+}_{D}\), checking the matches correctness of adjacent nodes of u (one iteration of the While loop) takes at most \(O(|P^{\downarrow }(u)||G^{\downarrow }(v)|+|P^{\uparrow }(u)||G^{\uparrow }(v)|)\) time. By considering all possible incorrect matches of a single node u in P we obtain the following cost: \(O(|P^{\downarrow }(u)||\bigcup _{v\in V} G^{\downarrow }(v)|+|P^{\uparrow }(u)||\bigcup _{v\in V} G^{\uparrow }(v)|)\) which is bounded by \(O(|P^{\downarrow }(u)||E|+|P^{\uparrow }(u)||E|)\). We conclude that the refinement of \(S^{+}_{D}\), after deletion of all possible incorrect matches, takes at most \(O(|\bigcup _{u\in V_P} P^{\downarrow }(u)||E|+|\bigcup _{u\in V_P} P^{\uparrow }(u)||E|)\) time which remains bounded by \(O(|E_P||E|)\).

Given the aboves, procedure Dual\(^{+}\) determines whether \(\varPi (P)\prec _Q G\) and finds the corresponding maximum match relation in at most \(O(|E_P||G|+|V_P||V|)\) time, which is bounded by O(|P||G|) time as stated in Lemma 1.

1.3 Appendix 3: Proof of Theorem 1

Given a data graph \(G=(V,E,\lambda )\) and a QGP \(P=(V_P,E_P,\lambda _P,f_P)\). The match relation \(S^{+}_{D}\) is initialized in \(O(|V_P||V|)\) time. After that, procedure defineAuxStruct defines several auxiliary structures (\(G^{\uparrow }\), \(G^{\downarrow }\), \(P^{\uparrow }\), \(P^{\downarrow }\), \(P^{\Downarrow }\), \({\mathcal {M}}^{\uparrow }_{P,G}\), and \({\mathcal {M}}^{\downarrow }_{P,G}\)) in order to efficiently check whether \(P\prec _{Q} G\). Obviously, the indexed structures \(G^{\downarrow }\) and \(G^{\uparrow }\) (resp. \(P^{\downarrow }\) and \(P^{\uparrow }\)) are constructed over the graph G (resp. P) in \(O(|V|+|E|)\) time (resp. \(O(|V_P|+|E_P|)\)), while \(P^{\Downarrow }\) is defined in \(O(|V_P||E_P|)\) time. The structures \({\mathcal {M}}^{\downarrow }_{P,G}\) and \({\mathcal {M}}^{\uparrow }_{P,G}\) are defined in \(O(|V_P||E|)\) time as follows. For each couple \((u,v)\in S_Q\) and each edge \(e=(v_p,v)\) in E with \(\lambda (e)=l\), we add v into the set \({\mathcal {M}}^{\downarrow }_{P,G}(v_p,u,l)\). Moreover, for each edge \(e=(v,v_c)\) in E with \(\lambda (e)=l\), we add v into the set \({\mathcal {M}}^{\uparrow }_{P,G}(v_c,u,l)\). This process requires \(O(|G^{\downarrow }\cup G^{\uparrow }|)\) time for each couple (uv), (|E|) time for all possible matches of u, and thus, \(O(|V_P||E|)\) time for all possible couples in \(S_Q\).

The match relation \(S^{+}_{D}\) is refined in O(|P||G|) time w.r.t \(\varPi (P)\) (line 3). Next, the dependency levels function \(\mathcal {DL}_{P}\) is defined in \(O(|P|+|E_P|^{2}|V_P|)\) time (line 5) in order to refine \(S^{+}_{D}\) w.r.t the right-bound of each BCQ in P. For each edge \(e=(u,u^{\prime })\) in \(E_P\) that belongs to dependency level l, all potential matches of u are examined to check whether they satisfy the right-bound of \(f_P(e)\). Since each edge in P belongs to only one dependency level then this process is done only one time over each edge in P. Thus, the For-Each loop (lines 9–13) is done in \(O(|E_P||V|)\) time. As shown in Appendix 2, procedure UpdateStruct takes at most \(O(|E_P||E|)\) time by considering deletion of all possible incorrect matches in \(S_Q\). Since there may be at most \(|E_P|\) dependency levels in P, then the line 17 takes at most \(O(|E_P||P||G|)\) time. We conclude that the While loop (lines 7–20) takes at most \(O(|E_P||V|+|E_P||E|+|E_P||P||G|)\) time.

Finally, procedure Match\(^{Q}\) is called to construct the match result \({\mathcal {M}}_Q(P,G)\) w.r.t the final refined version of \(S_Q\). This construction takes at most \(O(|V_P||V|+|E_P||G|)\) time.

Given the aboves, the overall running time of algorithm Match\(^{Q}\) is given essentially by \(O(|E_P||P||G| + |E_P|^{2}|V_P|)\) time, which is bounded by \(O(|E_P||P||G|)\).Footnote 12

1.4 Appendix 4: Proof of Theorem 2

$$\begin{aligned} ***\,{\textit{Case}}\, {\textit{of}}\, {\textit{LRCs}}\, {\textit{defined}}\, {\textit{over}}\, {\textit{children}}\,*** \end{aligned}$$

Given a data graph \(G=(V,E,\lambda )\), a QGP \(P=(V_P,E_P,\lambda _P,f_P)\), and a LRCs relation \(\mathcal {LR}_P\). Let \(u_1,\dots ,u_k\) (\(k\ge 2\)) be all the children of \(u\in V_P\) that are explicitly concerned by some LRC in \(\mathcal {LR}_P\). The bipartite graph that inspects these LRCs w.r.t a potential match \(v\in V_G\) of u is given by \(BG=(X\cup Y,E)\). Let \(M^{\downarrow }_{P,G}(v,u_i,\lambda _P(u,u_i))\) be the set of children of v in G that potentially match \(u_i\).

Based on the notions of Maximum Matching in Bipartite Graphs [9, 20] and General System of Distinct Representatives [1], the proof of Theorem 2 can be done in two steps (\(l_i\) is the left-bound of \(f_P(u,u_i)\) for \(i\in [1,k]\)):

  1. (a)

    The LRCs defined over children of u are satisfied by children of v iff the sets \(M^{\downarrow }_{P,G}(v,u_1,\lambda _P(u,u_1)),\dots ,M^{\downarrow }_{P,G}(v,u_k,\lambda _P(u,u_k))\) admit a GSDR with cardinalities \(l_1,\dots ,l_k\).

  2. (b)

    The sets \(M^{\downarrow }_{P,G}(v,u_1,\lambda _P(u,u_1)),\dots ,M^{\downarrow }_{G}(v,u_k,\lambda _P(u,u_k))\) admit a GSDR with cardinalities \(l_1,\dots ,l_k\) iff BG has an X-saturating matching.

(a) Suppose that the sets \(M^{\downarrow }_{P,G}(v,u_1,\lambda _P(u,u_1)),\dots ,M^{\downarrow }_{P,G}(v,u_k,\lambda _P(u,u_k))\) have a GSDR S with cardinalities \(l_1,\dots ,l_k\). This means that [1] \(S=\bigcup ^{k}_{i=1} S_i\) where: \(S_i\subseteq M^{\downarrow }_{P,G}(v,u_i,\lambda _P(u,u_i))\), \(|S_i|=l_i\), and \(\bigcap ^{k}_{i=1} S_i = \emptyset\) for \(i\in [1,k]\). Each \(S_i\) contains \(l_i\) distinct representatives of the set \(M^{\downarrow }_{P,G}(v,u_i,\lambda _P(u,u_i))\). Ii is clear that if S exists then we can match \(l_i\) distinct children of v into each child \(u_{1\le i\le k}\) of u. Moreover, given two children \(u_i\) and \(u_j\) of u (\(i\ne j, i,j\in [1,k]\)), children of v that match \(u_i\) (subset \(S_i\)) are different from those that match \(u_j\) (subset \(S_j\)). Therefore, we can conclude that the children of v satisfy the LRCs defined over the children \(u_1,\dots ,u_k\) of u in P. The remainder sense of equivalence (a) can be done in similar way.

(b) Recall that X contains \(l_i\) copies of each child \(u_i\) of u (\(i\in [1,k]\)), and Y contains each child of v that matches at least one child \(u_i\) of u (i.e. Y=\(\bigcup ^{k}_{i=1}M^{\downarrow }_{P,G}(v,u_i,\lambda _P(u,u_i))\)). An X-saturating matching over BG is a subset \(S\subseteq E\) such that: i) \(|S|=|X|\) and ii) no two edges in S have a common vertex neither in X nor Y. We first conclude from i) that \(|S|=\varSigma ^{k}_{i=1} l_i\), which is a necessary condition for the existence of the GSDR we are looking for. From ii), we conclude that each node in Y (some child of v) is mapped to only one node in X (some child of u). In other words, each copy of \(u_{1\le i\le k}\) in X is matched with one and only one node in Y. This means that all copies of \(u_i\) in X are matched exactly by \(l_i\) distinct nodes in Y. Moreover, given two children \(u_i\) and \(u_j\) of u (\(i\ne j, i,j\in [1,k]\)), the \(l_i\) nodes in Y that match copies of \(u_i\) in X are different from those that match copies of \(u_j\). Therefore, the X-saturating matching S forms a GSDR of cardinalities \(l_1,\ldots ,l_k\) over the sets \(M^{\downarrow }_{P,G}(v,u_1,\lambda _P(u,u_1)),\dots ,M^{\downarrow }_{P,G}(v,u_k,\lambda _P(u,u_k))\). The remainder sense of equivalence (b) can be done in similar way.

From (a) and (b), we conclude the result of Theorem 2: The LRCs defined over children of u are satisfied by children of v iff the bipartite graph BG that inspects them has an X-saturating matching.

$$\begin{aligned} ***\,{\textit{Case}}\, \,{\textit{of}} \,{\textit{LRCs}}\,{\textit{defined}}\,{\textit{over}}\,{\textit{parents}}\, *** \end{aligned}$$

Consider now all the LRCs defined over parents of u, and let \(u_1,\ldots ,u_k\) (\(k\ge 2\)) be these parents. The proof of Theorem 2 can be done in two steps:

  1. (c)

    The LRCs defined over parents of u are satisfied by parents of v iff there exists a SSDR over the sets \(M^{\uparrow }_{P,G}(v,u_1,\lambda _P(u_1,u)),\dots ,M^{\uparrow }_{P,G}(v,u_k,\lambda _P(u_k,u))\).

  2. (d)

    The sets \(M^{\uparrow }_{P,G}(v,u_1,\lambda _P(u_1,u)),\dots ,M^{\uparrow }_{P,G}(v,u_k,\lambda _P(u_k,u))\) admit a SSDR iff BG has an X-saturating matching.

(c) Suppose that the sets \(M^{\uparrow }_{P,G}(v,u_1,\lambda _P(u_1,u)),\dots ,M^{\uparrow }_{P,G}(v,u_k,\lambda _P(u_k,u))\) have a SSDR S. This means that [18] S contains a distinct representative for each set \(M^{\uparrow }_{P,G}(v,u_i,\lambda _P(u_i,u))\). Precisely, \(S=\bigcup ^{k}_{i=1} v_i\) where: \(v_i\subseteq M^{\uparrow }_{P,G}(v,u_i,\lambda _P(u_i,u))\), and \(\bigcap ^{k}_{i=1} v_i = \emptyset\) for \(i\in [1,k]\). Ii is clear that if S exists then we can match exactly one distinct parent of v (\(v_i\)) into each parent \(u_{1\le i\le k}\) of u. Moreover, given two parents \(u_i\) and \(u_j\) of u (\(i\ne j, i,j\in [1,k]\)), parent of v that matches \(u_i\) is different from the one that matches \(u_j\). Therefore, we can conclude that v has at least k distinct parents that can satisfy the LRCs defined over the parents \(u_1,\dots ,u_k\) of u in P. The remainder sense of equivalence (c) can be done in similar way.

(d) Recall that X contains each parent \(u_i\) of u (\(i\in [1,k]\)), and Y contains each parent of v that matches at least one parent \(u_i\) of u (i.e. Y=\(\bigcup ^{k}_{i=1}M^{\uparrow }_{P,G}(v,u_i,\lambda _P(u_i,u))\)). An X-saturating matching over BG is a subset \(S\subseteq E\) such that: i) \(|S|=|X|\) and ii) no two edges in S have a common vertex neither in X nor Y. We first conclude from i) that \(|S|=k\), which is a necessary condition for the existence of the SSDR we are looking for. From ii), we conclude that each node \(v_i\) in Y (some parent of v) is mapped to only one node \(u_i\) in X (some parent of u), and moreover, \(v_i\) is not mapped to other node in X. Thus, \(v_i\) is naturally the representative of the set \(M^{\uparrow }_{P,G}(v,u_i,\lambda _P(u_i,u))\). Therefore, the X-saturating matching S forms a SSDR over the sets \(M^{\uparrow }_{P,G}(v,u_1,\lambda _P(u_1,u)),\dots ,M^{\uparrow }_{P,G}(v,u_k,\lambda _P(u_k,u))\). The remainder sense of equivalence (d) can be done in similar way.

From (c) and (d), we conclude the result of Theorem 2: The LRCs defined over parents of u are satisfied by parents of v iff the bipartite graph BG that inspects them has an X-saturating matching.

1.5 Appendix 5: Proof of Lemma 2

Given a data graph \(G=(V,E,\lambda )\), a QGP \(P=(V_P,E_P,\lambda _P,f_P)\), and a LRCs relation \(\mathcal {LR}_P\). We first analyze the cost of procedure lrcsChecking that is given in Fig. 9. One can easily verify that the definition of the bipartite graphs \(BG^{c}\) (lines 4–8) and \(BG^{p}\) (lines 9–11) takes at most \(O(\ell |P^{\downarrow }(u)||G^{\downarrow }(v)|)\) time and \(O(|P^{\uparrow }(u)||G^{\uparrow }(v)|)\) time respectively. Next, we call procedure xsm to look for a maximum matching over \(BG^{c}\) (resp. \(BG^{p}\)). This latter is an implementation of Hopcroft and Karp algorithm [20] that takes at most \(O(|E|\sqrt{|V|})\) time to find a maximum matching in a bipartite graph composed by |V| vertices and |E| edges. Therefore, procedure Xsm takes at most \(O(\ell |P^{\downarrow }(u)||G^{\downarrow }(v)|\sqrt{\ell |P^{\downarrow }(u)|+|G^{\downarrow }(v)|})\) (resp. \(O(|P^{\uparrow }(u)||G^{\uparrow }(v)|\sqrt{|P^{\uparrow }(u)|+|G^{\uparrow }(v)|})\)) time to find a maximum matching over the bipartite graph \(BG^{c}\) (resp. \(BG^{p}\)). Next, we check in constant time (lines 16–20) whether the cardinality of this maximum matching equals to the size of the set \(X^{c}\) (resp. \(X^{p}\)).

Therefore, the overall cost of procedure lrcsChecking is given by \(O(\ell |P^{\downarrow }(u)||G^{\downarrow }(v)|\sqrt{\ell |P^{\downarrow }(u)|+|G^{\downarrow }(v)|} + |P^{\uparrow }(u)||G^{\uparrow }(v)|\sqrt{|P^{\uparrow }(u)|+|G^{\uparrow }(v)|})\) time.

By the proof of Lemma 1, we have seen that the old version of Dual\(^{+}\) takes at most O(|P||G|) time. The overall time complexity of the new version given in Fig. 8 can be determined by computing only the cost of the extension represented by the four blocks.

Block 1

For each node u in P, it takes at most \(O(|V_P|+|E_P|)\) time to define the sets DC(u) and DP(u) (lines 1–9).

Block 2

For each node \(u\in V_P\) and each match v of u, procedure lrcsChecking is called to check the satisfaction of LRCs defined over children and/or parents of u (lines 21–25). To simplify the complexity, let \(\sqrt{|P^{\downarrow }(u)|+|G^{\downarrow }(v)|}\) be bounded by \(\sqrt{|V_P|+|V|}\) since each node \(u\in V_P\) (resp. \(v\in V\)) may have at most \(|V_P|\) (resp. |V|) children in case of dense pattern (resp. data) graph. Similarly we bound \(\sqrt{|P^{\uparrow }(u)|+|G^{\uparrow }(v)|}\) by \(\sqrt{|V_P|+|V|}\). Since each node in P may have at most |V| possible matches in G, checking whether LRCs defined over u are satisfied by each possible match of u takes at most:

$$\begin{aligned} \varSigma ^{k}_{i=1} O(\ell |P^{\downarrow }(u)||G^{\downarrow }(v_i)|\sqrt{\ell |V_P|+|V|} + |P^{\uparrow }(u)||G^{\uparrow }(v_i)|\sqrt{|V_P|+|V|}){\text{ time}}\end{aligned}$$

where \(v_{1\le i\le k}\) is any match of u in \(S^{+}_D\). It is clear that \(\varSigma ^{k}_{i=1}G^{\downarrow }(v_i)\) (resp. \(\varSigma ^{k}_{i=1}G^{\uparrow }(v_i)\)) is bounded by |E|. Then, checking LRCs over all possible matches of a single node u takes at most:

$$\begin{aligned} O(\ell |P^{\downarrow }(u)||E|\sqrt{\ell |V_P|+|V|} + |P^{\uparrow }(u)||E|\sqrt{|V_P|+|V|})\text { time} \end{aligned}$$

By taking into account all nodes of P, we obtain the following cost:

$$\begin{aligned} \varSigma ^{k}_{i=1} O(\ell |P^{\downarrow }(u_i)||E|\sqrt{\ell |V_P|+|V|} + |P^{\uparrow }(u_i)||E|\sqrt{|V_P|+|V|}) \end{aligned}$$

where \(u_{1\le i\le k}\) is any node in P whose children and/or parents are concerned by some LRCs. Since \(\varSigma ^{k}_{i=1}P^{\downarrow }(u_i)\) (resp. \(\varSigma ^{k}_{i=1}P^{\uparrow }(u_i)\)) is bounded by \(|E_P|\), then the previous cost is still bounded by:

$$\begin{aligned} O(\ell |E_P||E|\sqrt{\ell |V_P|\!+\!|V|}\!+\!|E_P||E|\sqrt{|V_P|+|V|})=O(\ell |E_P||E|\sqrt{\ell |V_P|\!+\!|V|}) \end{aligned}$$

Which is the overall time complexity of block 2.

Block 3

Each time an incorrect match (uv) is removed from the match relation \(S^{+}_D\), we check whether children of v satisfy LRCs defined over children of u. Precisely, for each child \(u_c\) of u that is matched by a child \(v_c\) of v, if there is a LRC defined over parents of \(u_c\) including u (i.e. \((u_c,\uparrow ,\lambda _P(u))\in \mathcal {LR}_P\)), then we check whether \(v_c\) still satisfies this LRC (lines 32–34) after deleting the match (uv). By using procedure lrcsChecking, this process takes at most \(O(|P^{\uparrow }(u_c)||G^{\uparrow }(v_c)|\sqrt{|V_P|+|V|})\) time. The For-Each loop (lines 28–35) repeats this process over all possible children of u and their possible matches, and requires at most:

$$\begin{aligned} \varSigma _{u_c\in P^{\downarrow }(u)} \varSigma _{v_c\in G^{\downarrow }(v)}O(|P^{\uparrow }(u_c)||G^{\uparrow }(v_c)|\sqrt{|V_P|+|V|})\text { time} \end{aligned}$$

This cost is bounded by \(O(|E_P||E|\sqrt{|V_P|+|V|})\) time and represents the overall cost of block 3.

Block 4

Similarly to block 3, for each parent \(u_p\) of u that is matched by a parent \(v_p\) of v, if there is a LRC defined over children of \(u_p\) including u (i.e. \((u_p,\downarrow ,\lambda _P(u))\in \mathcal {LR}_P\)), then we check whether \(v_p\) still satisfies this LRC (lines 40–42) after deleting the match (uv). By using procedure lrcsChecking, this process takes at most \(O(\ell |P^{\downarrow }(u_p)||G^{\downarrow }(v_p)|\sqrt{\ell |V_P|+|V|})\) time. The For-Each loop (lines 36–43) repeats this process over all possible parents of u and their possible matches, and requires at most:

$$\begin{aligned} \varSigma _{u_p\in P^{\uparrow }(u)} \varSigma _{v_p\in G^{\uparrow }(v)} O(\ell |P^{\downarrow }(u_p)||G^{\downarrow }(v_p)|\sqrt{\ell |V_P|+|V|})\text { time} \end{aligned}$$

This cost is bounded by \(O(\ell |E_P||E|\sqrt{\ell |V_P|+|V|})\) time and represents the overall cost of block 4.

Since there may be at most \(|V_P||V|\) incorrect matches in \(S^{+}_D\), blocks 3 and 4 require at most \(O(|V_P||V||E_P||E|\sqrt{|V_P|+|V|})\) and \(O(\ell |V_P||V||E_P||E|\sqrt{\ell |V_P|+|V|})\) time respectively.

In summary, extensions made over the old version of procedure Dual\(^{+}\) require the following costs:

  • \(O(|V_P|+|E_P|)\) time for block 1;

  • \(O(\ell |E_P||E|\sqrt{|V_P|+|V|})\) time for block 2;

  • \(O(|V_P||V||E_P||E|\sqrt{|V_P|+|V|})\) time for block 3.

  • \(O(\ell |V_P||V||E_P||E|\sqrt{\ell |V_P|+|V|})\) time for block 4.

Therefore, the extended version of procedure Dual\(^{+}\) (Fig. 8) takes at most \(O(|P||G|+\ell |V_P||V||E_P||E|\sqrt{\ell |V_P|+|V|})\) time as stated by Lemma 2.

1.6 Appendix 6: Proof of Theorem 3

The proof is the same as Theorem 1 excepting the fact that the cost of procedure Dual\(^{+}\) is given by Lemma 2.

1.7 Appendix 7: Proof of Lemma 3

The preprocessing cost is equivalent to that of the old version of procedure lrcsChecking (Fig. 8) as shown in Appendix 5.

The cost of the maintenance process (lines 2–19) is detailed as follows. The set \(S^{del}\) can be computed in \(O(|P^{\downarrow }(u)||G^{\downarrow }(v)|+|P^{\uparrow }(u)||G^{\uparrow }(v)|)\) time by combining each child (resp. parent) \(u^{\prime }\) of u with each child (resp. parent) \(v^{\prime }\) of v and checking whether this couple \((u^{\prime },v^{\prime })\) belonged to \(S^{old}\) and does not belong to S. The maintenance of the previously defined \(BG^{c}\) (lines 6–10) and \(BG^{p}\) (lines 11–13) requires \(O(\ell |P^{\downarrow }(u)||G^{\downarrow }(v)|)\) and \(O(|P^{\uparrow }(u)||G^{\uparrow }(v)|)\) time respectively.

If \(k_c=|M^{c}|-|M^{c}\cap E^{c}|\) and \(k_p=|M^{p}|-|M^{p}\cap E^{p}|\), then there may be at most \(k_c\) (resp. \(k_p\)) augmenting paths over the updated \(BG^{c}\) (resp. \(BG^{p}\)). Procedure MaintainXsm requires at least one iteration to find each one of these paths. Thus, the whole maintenance of the previously computed maximum matching \(M^{c}\) (lines 14–16) takes at most \(O(k_c|E^{c}|)\) time which is bounded by \(O(k_c\ell |P^{\downarrow }(u)||G^{\downarrow }(v)|)\). The maintenance of the previously computed maximum matching \(M^{p}\) (lines 17–19) is done similarly in at most \(O(k_p|E^{p}|)\) time which is bounded by \(O(k_p|P^{\uparrow }(u)||G^{\uparrow }(v)|)\).

Hence, the overall time complexity of the maintenance process (lines 2–19) is bounded by \(O(k_c\ell |P^{\downarrow }(u)||G^{\downarrow }(v)|+k_p|P^{\uparrow }(u)||G^{\uparrow }(v)|)\) time.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mahfoud, H. Graph pattern matching with counting quantifiers and label-repetition constraints. Cluster Comput 23, 1529–1553 (2020). https://doi.org/10.1007/s10586-019-02977-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-019-02977-3

Keywords

Navigation