Skip to main content
Log in

Fast diversified coherent core search on multi-layer graphs

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Mining dense subgraphs on multi-layer graphs is an interesting problem, which has witnessed lots of applications in practice. To overcome the limitations of the quasi-clique-based approach, we propose d-coherent core (d-CC), a new notion of dense subgraph on multi-layer graphs, which has several elegant properties. We formalize the diversified coherent core search (DCCS) problem, which finds kd-CCs that can cover the largest number of vertices. We propose a greedy algorithm with an approximation ratio of \(1 - 1/e\) and two search algorithms with an approximation ratio of 1/4. Furthermore, we propose some optimization techniques to further speed up the algorithms. The experiments verify that the search algorithms are faster than the greedy algorithm and produce comparably good results as the greedy algorithm in practice. As opposed to the quasi-clique-based approach, our DCCS algorithms can fast detect larger dense subgraphs that cover most of the quasi-clique-based results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37
Fig. 38
Fig. 39

Similar content being viewed by others

Notes

  1. We do not need to consider \(I_0\) since vertices in \(I_0\) are not in the d-core on any layer of the multi-layer graph \({\mathcal {G}}\).

  2. http://string-db.org.

  3. http://cn.aminer.org.

  4. http://konect.uni-koblenz.de.

  5. http://snap.stanford.edu.

  6. http://mips.helmholtz-muenchen.de.

References

  1. Abdel-Rahim, A., Oman, P., Johnson, B.K., Sadiq R.A.: Assessing surface transportation network component criticality: a multi-layer graph-based approach. In: IEEE Intelligent Transportation Systems Conference, pp. 1000–1003 (2007)

  2. Ausiello, G., Boria, N., Giannakos, A., Lucarelli, G., Paschos, V.T.: Online maximum k-coverage. In: International Conference on Fundamentals of Computation Theory, pp. 181–192 (2011)

  3. Batagelj, V., Zaversnik, M.: An O(m) algorithm for cores decomposition of networks. Comput. Sci. 1(6), 34–37 (2003)

    Google Scholar 

  4. Bilbro, G.L.: Solution of the recirculant multilayer graph problem using compensated simulated annealing. In: Proceedings of SPIE, the International Society for Optical Engineering, vol. 1766 (1992)

  5. Boden, B., Nnemann, S., Hoffmann, H., Seidl, T.: Mining coherent subgraphs in multi-layer graphs with edge labels. In: KDD, pp. 1258–1266 (2012)

  6. Bogue, E.T., de Souza, C.C., Xavier, E.C., Freire, A.S.: An integer programming formulation for the maximum k-subset intersection problem. In: Lecture Notes in Computer Science, vol. 8596, pp. 87–99 (2014)

  7. Chakraborty, T., Narayanam, R.: Cross-layer betweenness centrality in multiplex networks with applications. In: ICDE, pp. 397–408 (2016)

  8. Chuang, J.R., Lin, J.M.: Efficient multi-layer obstacle-avoiding preferred direction rectilinear Steiner tree construction. In: Asia and South Pacific Design Automation Conference, pp. 527–532 (2011)

  9. David, C.W.: Stirling’s Approximation. Betascript Publishing, Saarbrücken (2007)

    Google Scholar 

  10. Dong, X., Frossard, P., Vandergheynst, P., Nefedov, N.: Clustering with multi-layer graphs: a spectral perspective. IEEE Trans. Signal Process. 60(11), 5820–5831 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  11. Fang, Y., Zhang, H., Ye, Y., Li, X.: Detecting hot topics from twitter: a multiview approach. J. Inf. Sci. 40(5), 578–593 (2014)

    Article  Google Scholar 

  12. Frickey, T., Weiller, G.: Mclip: motif detection based on cliques of gapped local profile-to-profile alignments. Bioinformatics 23(4), 502–3 (2007)

    Article  Google Scholar 

  13. Hu, H., Yan, X., Huang, Y., Han, J., Zhou, X.J.: Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics 21(suppl-1), i213 (2005)

    Article  Google Scholar 

  14. Kim, J., Lee, J.G.: Community detection in multi-layer graphs: a survey. ACM SIGMOD Record 44(3), 37–48 (2015)

    Article  Google Scholar 

  15. Kivelä, M., Arenas, A., Barthelemy, M., Gleeson, J.P., Moreno, Y., Porter, M.A.: Multilayer networks. J. Complex Netw. 2(3), 203–271 (2014)

    Article  Google Scholar 

  16. Lee, V.E., Ruan, N., Jin, R., Aggarwal, C.C.: A survey of algorithms for dense subgraph discovery. In: Aggarwal, C.C., Wang, H. (eds.) Managing and Mining Graph Data, pp. 303–336. Springer, New York (2010)

    Chapter  Google Scholar 

  17. Li, H., Nie, Z., Lee, W.C., Giles, L., Wen, J.R.: Scalable community discovery on textual data with relations. In: CIKM, pp. 1203–1212 (2008)

  18. Li, R.H., Qin, L., Yu, J.X., Mao, R.: Influential community search in large networks. PVLDB 8(5), 509–520 (2015)

    Google Scholar 

  19. Liu, J., Wang, C., Gao, J., Han, J.: Multi-view clustering via joint nonnegative matrix factorization. In: SDM, pp. 252–260 (2013)

  20. Pei, J., Jiang, D., Zhang, A.: On mining cross-graph quasi-cliques. In: KDD, pp. 228–238 (2005)

  21. Qi, G.J., Aggarwal, C.C., Huang, T.: Community detection with edge content in social media networks. In: ICDE, pp. 534–545 (2012)

  22. Ruan, Y., Fuhry, D., Parthasarathy, S.: Efficient community detection in large networks using content and links. In: WWW, pp. 1089–1098 (2012)

  23. Sanjeev, K., Gilbert, H.: Multilayer Networks. Wiley, Hoboken (2011)

    Google Scholar 

  24. Silva, A., Jr, W.M., Zaki, M.J.: Mining attribute-structure correlated patterns in large attributed graphs. PVLDB 5(5), 466–477 (2012)

    Google Scholar 

  25. Solé-Ribalta, A., De Domenico, M., Gómez, S., Arenas, A.: Centrality rankings in multiplex networks. In: Proceedings of the 2014 ACM Conference on Web Science, pp. 149–155. ACM (2014)

  26. Sun, Y., Yu, Y., Han, J.: Ranking-based clustering of heterogeneous information networks with star network schema. In: KDD, pp. 797–806 (2009)

  27. Szklarczyk, D., Morris, J.H., Cook, H., Kuhn, M., Wyder, S., Simonovic, M., Santos, A., Doncheva, N.T., Roth, A., Bork, P.: The string database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 45(Database), D362–D368 (2017)

    Article  Google Scholar 

  28. Tang, W., Lu, Z., Dhillon, I.S.: Clustering with multiple graphs. In: ICDM, pp. 1016–1021 (2009)

  29. Xu, Z., Ke, Y., Wang, Y., Cheng, H., Cheng, J.: A model-based approach to attributed graph clustering. In: SIGMOD, pp. 505–516 (2012)

  30. Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: ICDM, pp. 721–724 (2002)

  31. Yang, Y., Yan, D., Wu, H., Cheng, J., Zhou, S., Lui, J.C.S.: Diversified temporal subgraph pattern mining. In: KDD, pp. 1965–1974 (2016)

  32. Zeng, Z., Wang, J., Zhou, L., Karypis, G.: Coherent closed quasi-clique discovery from large dense graph databases. In: KDD, pp. 797–802 (2006)

  33. Zhang, J., Kong, X., Yu, P.S.: Predicting social links for new users across aligned heterogeneous social networks. In: ICDM, pp. 1289–1294 (2013)

  34. Zhou, P., Miao, G., Bing, B.: Cross-layer congestion control and scheduling in multi-hop OFDMA wireless networks. In: IEEE Conference on Global Telecommunications, pp. 1–6 (2009)

  35. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute similarities. PVLDB 2(1), 718–729 (2009)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China under Grant Nos. 61672189, 61532015, and 61732003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhaonian Zou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Missing proofs

Proof

(Property 1) Suppose \(C^{d}_{L}({\mathcal {G}})\) is not unique. Let \(C_1\) and \(C_2\) be two distinct d-CCs of \({\mathcal {G}}\) w.r.t. L. Let \(C = C_1 \cup C_2\). We have \(C_1 \subset C\) and \(C_2 \subset C\). On each layer \(i \in L\), \(G_i[C_1]\) is a subgraph of \(G_i[C]\). Thus, for each vertex \(v \in C\), we have \(d_{G_i[C]}(v) \ge d_{G_i[C_1]}(v) \ge d\) for all \(i \in L\). By the definition of d-CC, C is also a d-CC, so neither \(C_1\) nor \(C_2\) is maximum, which leads to a contradiction. Thus, \(C^{d}_{L}({\mathcal {G}})\) is unique. \(\square \)

Proof

(Property 2) Let \(d_1, d_2 \in {\mathbb {N}}\) and \(d_1 > d_2\). For each vertex \(v \in C^{d_1}_{L}({\mathcal {G}})\), we have \(d_{G_l[C^{d_1}_{L}({\mathcal {G}})]}(v) \ge d_1 > d_2\) for every layer number \(l \in L\). By the definition of d-CC, \(C^{d_1}_{L}({\mathcal {G}}) \subseteq C^{d_2}_{L}({\mathcal {G}})\). Thus, the property holds. \(\square \)

Proof

(Property 3) For each vertex \(v \in C^{d}_{L^{\prime }}({\mathcal {G}})\), we have \(d_{G_l[C^{d}_{L^{\prime }}({\mathcal {G}})]}(v) \ge d\) for each layer number \(l \in L\). Based on the definition of d-CC, we have \(C^{d}_{L^{\prime }}({\mathcal {G}}) \subseteq C^{d}_{L}({\mathcal {G}})\). Hence, the property holds. \(\square \)

Proof

(Lemma 1) It is clear that \(L_1 \subseteq L_1 \cup L_2\) and \(L_2 \subseteq L_1 \cup L_2\). By Property 3, we have \(C_{L_1 \cup L_2}^{d} ({\mathcal {G}}) \subseteq C_{L_1}^{d} ({\mathcal {G}})\) and \(C_{L_1 \cup L_2}^{d} ({\mathcal {G}}) \subseteq C_{L_2}^{d} ({\mathcal {G}})\). Thus, \(C_{L_1 \cup L_2}^{d} ({\mathcal {G}}) \subseteq C_{L_1}^{d} ({\mathcal {G}}) \cap C_{L_2}^{d}({\mathcal {G}})\). \(\square \)

Proof

(Lemma 2) Let \(C^d_{L^{\prime }}({\mathcal {G}})\) be a descendant of \(C^d_{L}({\mathcal {G}})\). We have \(L \subseteq L^{\prime }\). By Property 3, we have \(C^d_{L^{\prime }}({\mathcal {G}}) \subseteq C^d_{L}({\mathcal {G}})\). Thus, \(\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^d_{L^{\prime }}({\mathcal {G}})\}) \subseteq \textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^d_{L}({\mathcal {G}})\})\). Obviously, if we have \(|\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^d_{L}({\mathcal {G}})\})| < \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}})|\), we must have \(|\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^d_{L^{\prime }}({\mathcal {G}})\})| < \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}})|\). Thus, \(C^d_{L^{\prime }}({\mathcal {G}})\) cannot satisfy Eq. (1), which means none of the descendants of \(C^d_{L}({\mathcal {G}})\) satisfies Eq. (1). \(\square \)

Proof

(Lemma 3) For any subset \(D \subseteq L_P\) such that \(|D| = s - |L|\), since \(I_{D} = \cap _{i \in D} C^{d}(G_i)\), we have \(I_D \subseteq C^{d}(G_i)\) for each \(i \in D\). Consequently, for each vertex \(v \in I_D\), we have \(v \in C^{d}(G_i)\) for each \(i \in D\). That is, v must be contained in at least \(s - |L|\)d-cores on the layers in \(L_P\), so \(v \in I\). Therefore, we have \(I_D \subseteq I\). \(\square \)

Proof

(Lemma 4) Let \(C^d_{S}({\mathcal {G}})\) be a descendant of \(C^d_{L}({\mathcal {G}})\) such that \(|S| = s\). Let \(D = S - L\) and \(I_D = \cap _{i \in D} C^{d}(G_i)\). By Lemma 1, we have \(C^d_{S}({\mathcal {G}}) \subseteq C^d_{L}({\mathcal {G}}) \cap I_D\). Since \(I_D \subseteq I\) according to Lemma 3, we have \(C^d_{S}({\mathcal {G}}) \subseteq C^d_{L}({\mathcal {G}}) \cap I\). For ease of presentation, let \(C = C_{L}^d({\mathcal {G}}) \cap I\). We illustrate the relationships between \(\textsf {Cov}({\mathcal {R}})\), \(C^{*}({\mathcal {R}})\) and C in Fig. 40 with seven disjoint subsets A, B, D, E, F, G, and H. We have \( |\textsf {Cov}({\mathcal {R}})| = |A| + |B| + |D| + |F| + |G| + |H|, |C^{*}({\mathcal {R}})| = |B| + |D| + |G| + |H|, |C| = |D| + |E| + |F| + |G|, |{\varDelta }({\mathcal {R}}, C^*({\mathcal {R}}))| = |D| + |H|\).

Since \(|C| < \frac{1}{k}|\textsf {Cov}({\mathcal {R}})| + |{\varDelta }({\mathcal {R}}, C^*({\mathcal {R}}))|\), we have \(|D| +|E| + |F| + |G| < \frac{1}{k} (|A| + |B| + |D| + |F| + |G| + |H|) + |D| + |H|\). Thus,

$$\begin{aligned} \begin{aligned}&|\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C\})| \\&\quad = |A| + |B| + |D| + |E| + |F| + |G| \\&\quad < \frac{1}{k} (|A| + |B| + |D| + |G| + |F| + |H|) + |A|\\&\quad + |B| + |D| + |H| \\&\quad \le \left( 1 + \frac{1}{k}\right) (|A| + |B| + |D| + |G| + |F| + |H|) \\&\quad = \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}})|. \end{aligned} \end{aligned}$$

Since \(C^d_{S}({\mathcal {G}}) \subseteq C\), we have \(\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^{d}_{S}({\mathcal {G}})\}) \subseteq \textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C\})\). Then, we have \(|\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^{d}_{S}({\mathcal {G}})\})| \le |\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C\})| < \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}})|\). Thus, the lemma thus holds. \(\square \)

Fig. 40
figure 40

Relationships between \(\textsf {Cov}({\mathcal {R}})\), \(C^{*}({\mathcal {R}})\), and C

Proof

(Lemma 5) By the definitions of d-CC and d-core, we have \(C^d(G_j) = C^d_{\{j\}}({\mathcal {G}})\). By Lemma 1, we have \(C^{d}_{L \cup \{j\}}({\mathcal {G}}) \subseteq C_{L}^d({\mathcal {G}}) \cap C^d(G_j)\). Let \(C = C_{L}^d({\mathcal {G}}) \cap C^d(G_j)\), in similar to the proof of Lemma 4, if \(|C| < \frac{1}{k}|\textsf {Cov}({\mathcal {R}})| + |{\varDelta }({\mathcal {R}}, C^*({\mathcal {R}}))|\), we have \(\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^{d}_{L \cup \{j\} }({\mathcal {G}})\}) \subseteq \textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C\})\). Then, we must have \(|\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^{d}_{L \cup \{j\} }({\mathcal {G}})\})|\le |\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C\})|< \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}})|\). The lemma thus holds. \(\square \)

Proof

(Lemma 6) Since \(L \subseteq S\), we have \(L \cup \{j\} \subseteq S \cup \{ j\}\). According to Property 3, we have \(C^d_{S \cup \{j\}}({\mathcal {G}}) \subseteq C^d_{L \cup \{j\}}({\mathcal {G}})\). Therefore, \(\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^d_{S \cup \{j\}}({\mathcal {G}})\}) \subseteq \textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^d_{L \cup \{j\}}({\mathcal {G}})\})\). Since \(C^d_{L \cup \{j\}}({\mathcal {G}})\) does not satisfy Eq. (1), we must have \(|\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^d_{S \cup \{j\}}({\mathcal {G}})\})| < \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}})|\). Thus, the lemma holds. \(\square \)

Proof

(Lemma 7) According to the usage of potential vertex sets, for any descendant \(C^d_{L^{\prime }}({\mathcal {G}})\) of \(C^d_L({\mathcal {G}})\) with \(|L^{\prime }| = s\), we have \(C^d_{L^{\prime }}({\mathcal {G}}) \subseteq U^d_{L}({\mathcal {G}})\). In similar to the proof of Lemma 2, the lemma holds. \(\square \)

Proof

(Lemma 8) Similar to the proof of Lemma 5, we have \(|\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{U^d_{L - \{ j\}}({\mathcal {G}})\})| < \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}})|\). According to the usage of potential sets, for any descendant \(C^d_{L^{\prime }}({\mathcal {G}})\) of \(C^d_L({\mathcal {G}})\) with \(|L^{\prime }| = s\), we have \(C^d_{L^{\prime }}({\mathcal {G}}) \subseteq U^d_{L}({\mathcal {G}})\). Thus, we must have \( |\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{C^d_{L^{\prime }}({\mathcal {G}})\})| \le |\textsf {Cov}(({\mathcal {R}}- \{C^{*}({\mathcal {R}})\}) \cup \{U^d_{L - \{ j\}}({\mathcal {G}})\})| < \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}})|\). Thus, the lemma holds. \(\square \)

Fig. 41
figure 41

Relationships between \(C^d_{L}({\mathcal {G}})\), \(C^d_{S_1}({\mathcal {G}})\), \(C^d_{S_2}({\mathcal {G}})\), and \(U^d_{L}({\mathcal {G}})\)

Proof

(Lemma 9) We illustrate the relationships between \(C^d_{L}({\mathcal {G}})\), \(C^d_{S_1}({\mathcal {G}})\), \(C^d_{S_2}({\mathcal {G}})\), and \(U^d_{L}({\mathcal {G}})\) in Fig. 41 with five disjoint subsets A, B, C, D, and E. We have \(|C^d_{S_1}({\mathcal {G}})| = |A| + |B| + |C|, |C^d_{S_2}({\mathcal {G}})| = |A| + |C| + |D|, |U^{d}_{L}({\mathcal {G}})| = |A| + |B| + |C| + |D| + |E|, |C^d_{S_1}({\mathcal {G}}) \cap C^d_{S_2}({\mathcal {G}})| = |A|\).

Since \(C^d_{S_1}({\mathcal {G}})\) can update \({\mathcal {R}}\), Lemma 5 implies that \(|C^d_{S_1}({\mathcal {G}})| \ge \frac{1}{k}|\textsf {Cov}({\mathcal {R}})| + |{\varDelta }({\mathcal {R}}, C^*({\mathcal {R}}))|\). Let \({\mathcal {R}}^{\prime }\) be the resulting \({\mathcal {R}}\) after updating \({\mathcal {R}}\) with \(C^d_{S_1}({\mathcal {G}})\). We have \(|\textsf {Cov}({\mathcal {R}}^{\prime })| \ge \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}})|\).

Suppose \(C^d_{S_2}({\mathcal {G}})\) can update \({\mathcal {R}}^{\prime }\) again, then we have \(|\textsf {Cov}(({\mathcal {R}}^{\prime } - \{C^{*}({\mathcal {R}}^{\prime })\}) \cup \{C^d_{S_2}({\mathcal {G}})\})| \ge \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}}^{\prime })| \ge \left( \frac{1}{k} + \frac{1}{k^2}\right) |\textsf {Cov}({\mathcal {R}})|\). Since \(A \cup C \subset C^d_{S_2}({\mathcal {G}})\), \( \textsf {Cov}(({\mathcal {R}}^{\prime } - \{C^{*}({\mathcal {R}}^{\prime })\}) \cup \{C^d_{S_2}({\mathcal {G}})\}) = \textsf {Cov}({\mathcal {R}}^{\prime }) - {\varDelta }({\mathcal {R}}^{\prime }, C^{*}({\mathcal {R}}^{\prime })) + D \subseteq \textsf {Cov}({\mathcal {R}}^{\prime }) + D\).

Putting them together, we have \(|\textsf {Cov}({\mathcal {R}}^{\prime })| + |D| \ge |\textsf {Cov}(({\mathcal {R}}^{\prime } - \{C^{*}({\mathcal {R}}^{\prime })\}) \cup \{C^d_{S_2}({\mathcal {G}})\})| \ge \left( 1 + \frac{1}{k}\right) |\textsf {Cov}({\mathcal {R}}^{\prime })|\). That means \(|D| \ge \frac{1}{k} |\textsf {Cov}({\mathcal {R}}^{\prime })|\). Thus, for \(U^d_{L}({\mathcal {G}})\), we have

$$\begin{aligned} |U^d_{L}({\mathcal {G}})|&= |A| + |B| + |C| + |D| + |E| \ge |C^d_{S_1}({\mathcal {G}})| + |D| \\&\ge \frac{1}{k}|\textsf {Cov}({\mathcal {R}})| + |{\varDelta }({\mathcal {R}}, C^*({\mathcal {R}}))| + \frac{1}{k} |\textsf {Cov}({\mathcal {R}}^{\prime })| \\&= \left( \frac{1}{k} + \frac{1}{k^2}\right) |\textsf {Cov}({\mathcal {R}})| + |{\varDelta }({\mathcal {R}}, C^*({\mathcal {R}}))|\\&\quad + \frac{1}{k}|\textsf {Cov}({\mathcal {R}})|\\&\ge \left( \frac{1}{k} + \frac{1}{k^2}\right) |\textsf {Cov}({\mathcal {R}})|\\&\quad + \left( 1 + \frac{1}{k}\right) |{\varDelta }({\mathcal {R}}, C^*({\mathcal {R}}))|. \end{aligned}$$

The last equation holds due to the pigeonhole principle. For each \(C^{\prime } \in {\mathcal {R}}\), we must have \(|{\varDelta }({\mathcal {R}}, C^{\prime })| \le \frac{1}{k}|\textsf {Cov}({\mathcal {R}})|\). Now, \(|U^d_{L}({\mathcal {G}})|\) contradicts with Eq. (2). Thus, if \(U^d_{L}({\mathcal {G}})\) satisfies Eq. (2), \(C^d_{S_2}({\mathcal {G}})\) cannot update \({\mathcal {R}}\) any more. \(\square \)

Proof

(Lemma 10) We prove that if there not exists a candidate path in the index to w, w certainly does not exist in \(C^{d}_{L}({\mathcal {G}})\).

First, for each vertex v in the lowest level, if \(L \not \subseteq L(v)\), there must exist a layer number \(j \in L\) such that \(v \notin C^{d}(G_j)\). By Lemma 1, we must have \(v \not \in C^{d}_{L}({\mathcal {G}})\). Thus, we can remove all such vertices from the graph and the index. After that, we consider each vertex w in the next level of the lowest level. At this time, all of w’s neighbors u in the lowest level such that \(L \not \subseteq L(u)\) have already been removed from the graph. Thus, vertex w has the same neighbors as we build the index. If \(L \not \subseteq L(w)\), there must exist a layer number \(j^{\prime } \in L\) such that \(w \notin C^{d}(G_{j^{\prime }})\). By Lemma 1, w cannot be contained in \(C^{d}_{L}({\mathcal {G}})\). We can continue this process level by level. This implies that all the vertices that do not satisfy this condition cannot exist in \(C^{d}_{L}({\mathcal {G}})\). \(\square \)

Proof

(Lemma 11) We have \(C^{d}_{L}({\mathcal {G}}) \subseteq Y\) by the definition of Y. For each vertex v, if \(v \in \bigcup _{h = 0}^{|L| - 1} I_{h}\), the support of v is less than |L|. Thus, v is unlikely to exist in a d-CC on at least |L| layers. Therefore, we must have \(v \in \bigcup _{h = |L|}^{l({\mathcal {G}})} I_{h}\). Thus, the lemma holds. \(\square \)

Proof

(Lemma 12) We prove this lemma by simply contradiction. Suppose there exists a vertex \(v \not \in C^{d}_{L}({\mathcal {G}})\) and v is not set to be discarded after the FastdCC procedure. We must have \(d^{+}_{i}(v) \ge d\) for each \(i \in L\). Otherwise, v must set to be discarded at line 2 of the ProcessUnd procedure or line 5 of the ProcessDis procedure. At this time, since v only connects to undetermined or existing vertices, we have \(d_{G_i}(v) = d^{+}_{i}(v) \ge d\) for each \(i \in L\). Therefore, we have \(v \in C^{d}_{L}({\mathcal {G}})\), which leads to a contradiction. Thus, the lemma holds. \(\square \)

Proof

(Lemma 13) To analyze the time complexity of FastdCC procedure, we first analyze the cases when an edge can be accessed as follows:

  1. (1)

    At line 4 of the FastdCC procedure, when computing \(d^{+}_{i}(v)\) and \(d^{*}_{i}(v)\) of all \(i \in L\) for each vertex v, each edge (uv) on a layer \(i \in L\) will be accessed exactly once.

  2. (2)

    At line 8 of the ProcessUnd procedure, when vertex u accesses a vertex v on a higher level, each edge (uv) on a layer \(i \in L^{\prime }\) will be accessed exactly once.

  3. (3)

    At line 3 of the ProcessDis procedure, when updating \(d^{+}_{i}(u)\), the edge (uv) on a layer \(i \in L\) will be accessed. Note that the edge (uv) on a layer \(i \in L\) will be accessed only once. This is because, when updating \(d^{+}_{i}(u)\), v has already been set to be discarded. The state of a discarded vertex will never change. Thus, v will never have chance to visit u any more. Meanwhile, since v is discarded, u also will not visit vertex v afterward.

  4. (4)

    At line 3 of the ProcessEst procedure, when updating \(d^{*}_{i}(u)\), the edge (uv) on a layer \(i \in L\) will be accessed. In similar, at this time, v has already been set to be existing. The state of an existing vertex will also never change. Thus, the edge (uv) on a layer \(i \in L\) will also be accessed only once.

Thus, the total edge access time is \(O(4\sum _{i \in L^{\prime }} |E_{i}({\mathcal {G}})|) = O(4m^{\prime })\).

Next, we analyze the time cost on comparing \(d^{+}_{i}(v)\) and \(d^{*}_{i}(v)\) on each vertex v w.r.t. d as follows.

  1. (1)

    At line 1 and line 3 of the ProcessUnd procedure, we compare \(d^{+}_{i}(v)\) and \(d^{*}_{i}(v)\) w.r.t. d for each vertex v which is set to undetermined at the first time. The time cost for vertex v is O(2l). Thus, the total time cost for all vertices is O(2nl).

  2. (2)

    At line 4 of the ProcessDis procedure which compares \(d^{+}_{i}(v)\) w.r.t. d, since the comparison can be involved in the updating of \(d^{+}_{i}(v)\) at line 3, the comparison times equals to the number of the edge access times. As we analyzed earlier, each edge on each level is accessed at most once, so the total time cost is \(O(m^{\prime })\).

  3. (3)

    At line 4 of the ProcessEst procedure which compares \(d^{*}_{i}(v)\) w.r.t. d, the comparison can also be involved in the updating of \(d^{*}_{i}(v)\) at line 3. In similar, the comparison times equals to the number of the edge access times. The total time cost is \(O(m^{\prime })\).

Meanwhile, the time cost to set the states of each vertex is at most O(4n) since there are only four states in the procedure. Putting them together, the time complexity of the FastdCC procedure is \(O(2nl + 6m^{\prime } + 4n) = O(nl + m^{\prime })\). \(\square \)

Proof

(Theorem 1) Given a collection of sets \({\mathcal {F}}= \{C_1, C_2, \dots , C_n\}\) and \(k \in {\mathbb {N}}\), the max-k-cover problem is to find a subset \({\mathcal {R}}\subseteq {\mathcal {F}}\) such that \(|{\mathcal {R}}| = k\) and that \(|\textsf {Cov}({\mathcal {R}})|\) is maximized. The max-k-cover problem has been proved to be NP-complete unless P \(=\) NP [2].

It is easy to show that the DCCS problem is in NP. We prove the theorem by reduction from the max-k-cover problem in polynomial time. Given an instance \(({\mathcal {F}}, k)\) of the max-k-cover problem, we first construct a multi-layer graph \({\mathcal {G}}\). The vertex set of \({\mathcal {G}}\) is \(\bigcup _{i = 1}^n C_i\). There are n layers in \({\mathcal {G}}\). An edge (uv) exists on layer i if and only if \(u, v \in C_i\) and \(u \ne v\). Then, we construct an instance of the DCCS problem \(({\mathcal {G}}, d, s, k)\), where \(d = 1\) and \(s = 1\). The result of the DCCS problem instance \(({\mathcal {G}}, d, s, k)\) is exactly the result of the max-k-cover problem instance \(({\mathcal {F}}, k)\). The reduction can be done in polynomial time. Thus, the DCCS problem is NP-complete. \(\square \)

To prove Theorems 2 and 3, we first state the following claim. The correctness of the claim has been proved in [2].

Claim

Let \({\mathcal {F}}= \{C_1, C_2, \dots , C_n\}\) and \(k \in {\mathbb {N}}\). Let \({\mathcal {R}}^{*}\) the subset of \({\mathcal {F}}\) such that \(|{\mathcal {R}}^*| = k\) and \(|\textsf {Cov}({\mathcal {R}}^*)|\) is maximized. Let \({\mathcal {R}}\subseteq {\mathcal {F}}\) be a set obtained in the following way. Initially, \({\mathcal {R}}= \emptyset \). We repeat taking an element C out of \({\mathcal {F}}\) randomly and updating \({\mathcal {R}}\) with C according to the two rules specified in Sect. 5.1 until \({\mathcal {F}}= \emptyset \). Finally, we have \(|\textsf {Cov}({\mathcal {R}})| \ge \frac{1}{4} |\textsf {Cov}({\mathcal {R}}^{*})|\).

Proof

(Theorem 3) Note that the BU-DCCS algorithm uses the same procedure described in Claim A to update \({\mathcal {R}}\) except that some pruning techniques are applied as well. Therefore, we only need to show that the pruning techniques will not affect the approximation ratio stated in Claim A. Let C be a d-CC pruned by a pruning method and \(D_C\) be the set of descendant candidate d-CCs of C in the search tree. For all \(C^{\prime } \in D_C\), according to Lemmas 2, 4, or 5, \(C^{\prime }\) must not update \({\mathcal {R}}\). By Claim A, candidate d-CCs can be taken in an arbitrary order without affecting the approximation ratio. Therefore, we can safely ignore all the d-CCs in \(D_C\) without affecting the quality of \({\mathcal {R}}\). Finally, we have \(|\textsf {Cov}({\mathcal {R}})| \ge \frac{1}{4} |\textsf {Cov}({\mathcal {R}}^*)|\). Thus, the theorem holds. \(\square \)

Proof

(Theorem 4) The TD-DCCS algorithm uses the same procedure described in Claim A to update \({\mathcal {R}}\) and applies some pruning techniques in addition. By the same arguments in the proof of Theorem 2, this theorem holds. \(\square \)

The Update procedure

We present the Update procedure in Fig. 42. The input includes the set \({\mathcal {R}}\) of temporary top-k diversified d-CCs, a newly generated d-CC C and \(k \in {\mathbb {N}}\). For each d-CC \(C^{\prime } \in {\mathcal {R}}\), we store both \(C^{\prime }\) and the size \(|{\varDelta }({\mathcal {R}}, C^{\prime })| \). To facilitate fast updating of \({\mathcal {R}}\), we build some auxiliary data structures. Specifically, we store \({\mathcal {R}}\) in two hash tables M and H. For each entry in M, the key of the entry is a vertex v, and the value of the entry is \(M[v] = \{ C^{\prime } | C^{\prime } \in {\mathcal {R}}, v \in C^{\prime } \}\), that is, the set of d-CCs \(C^{\prime } \in {\mathcal {R}}\) containing vertex v. For each entry in H, the key of the entry is an integer i, and the value of the entry H[i] is the set of d-CCs \(C^{\prime } \in R\) such that \(|{\varDelta }({\mathcal {R}}, C^{\prime })| = i\). Obviously, \(C^{*}({\mathcal {R}})\) can be easily obtained from H by retrieving the entry of H indexed by the smallest key.

Given the temporary result set \({\mathcal {R}}\) and a new d-CC C, the procedure relies on three key operations to update \({\mathcal {R}}\), namely Size(\({\mathcal {R}}\), C) that returns the size \(|\textsf {Cov}( ({\mathcal {R}}- \{C^{*}({\mathcal {R}}) \} ) \cup \{ C \} )|\), Delete(\({\mathcal {R}}\)) that removes \(C^{*}({\mathcal {R}})\) from \({\mathcal {R}}\), and Insert(\({\mathcal {R}}\), C) that inserts C to \({\mathcal {R}}\). We describe these procedures as follows.

Fig. 42
figure 42

The Update, Size, Delete, and Insert procedure

Operation\({\mathsf {Size}({\mathcal {R}}, C)}\) Note that \(\textsf {Cov}( ({\mathcal {R}}- \{C^{*}({\mathcal {R}}) \} ) \cup \{ C \}\) can be decomposed into three disjoint subsets \(\textsf {Cov}({\mathcal {R}}- \{ C^{*}({\mathcal {R}}) \} )\), \(C - \textsf {Cov}({\mathcal {R}})\) and \(C \cap {\varDelta }({\mathcal {R}}, C^{*}({\mathcal {R}}))\). In the beginning, we can obtain \(C^{*}({\mathcal {R}})\) and \(|{\varDelta }({\mathcal {R}}, C^{*}({\mathcal {R}}) )|\) from H (line 1) and initialize the counter c to 0 (line 1). For each vertex \(v \in C\), if v is not a key in M, we have \(v \in C - \textsf {Cov}({\mathcal {R}})\), so we increase c by 1 (line 5). Otherwise, if \(v \in C^{*}({\mathcal {R}})\) and M[v] only contains \(C^{*}({\mathcal {R}})\), c is also increased by 1 (line 7) since \(v \in C \cap {\varDelta }({\mathcal {R}}, C^{*}({\mathcal {R}}))\). Since \(|\textsf {Cov}({\mathcal {R}}- \{ C^{*}({\mathcal {R}}) \} )|\) is equal to \(\textsf {size}(M) - |{\varDelta }({\mathcal {R}}, C^{*}({\mathcal {R}}))|\), we accumulate \(\textsf {size}(M) - |{\varDelta }({\mathcal {R}}, C^{*}({\mathcal {R}}))|\) to c (line 8) and return c as the result (line 9).

Operation\({\mathsf {Delete}({\mathcal {R}})}\) First, we retrieve \(C^{*}({\mathcal {R}})\) from H (line 1). For each vertex \(v \in C^{*}({\mathcal {R}})\), \(C^{*}({\mathcal {R}})\) is removed from M[v] (line 3). Note that, if M[v] contains a single element \(C^{\prime }\) after removing \(C^{*}({\mathcal {R}})\), v is a vertex only covered by \(C^{\prime }\). Therefore, we move \(C^{\prime }\) from \(H[|{\varDelta }({\mathcal {R}}, C^{\prime })|]\) to \(H[|{\varDelta }({\mathcal {R}}, C^{\prime })| + 1]\) (line 6) and increase \(|{\varDelta }({\mathcal {R}}, C^{\prime })|\) by 1 (line 7). If M[v] is empty, v is not covered by \({\mathcal {R}}\), so v is removed from M (line 9).

Operation\({\mathsf {Insert}({\mathcal {R}}, C)}\) First, we insert C to \({\mathcal {R}}\) (line 1) and then set \(|{\varDelta }({\mathcal {R}}, C)|\) to 0 (line 2). For each vertex \(v \in C\), if v is not a key in M, we insert an entry with key v and value C to hash table M (lines 5–6). At this moment, v is only covered by C, so \(|{\varDelta }({\mathcal {R}}, C)|\) is increased by 1 (line 7). If v is a key in M, C can be directly inserted to M[v] (line 12). Note that, if M[v] contains a single element \(C^{\prime }\) before insertion, v will not be covered only by \(C^{\prime }\) after inserting C, so \(C^{\prime }\) is moved in H from \(H[|{\varDelta }({\mathcal {R}}, C^{\prime })|]\) to \(H[|{\varDelta }({\mathcal {R}}, C^{\prime })| - 1]\) (line 11), and \(|{\varDelta }({\mathcal {R}}, C^{\prime })|\) is decreased by 1 (line 12). After updating M, we obtain \(|{\varDelta }({\mathcal {R}}, C)|\) and insert C to H accordingly (line 14).

Putting them altogether, we have the Update procedure. If \(|{\mathcal {R}}| < k\), we directly insert C to \({\mathcal {R}}\) (line 2). If \(|{\mathcal {R}}| \ge k\), the Size(\({\mathcal {R}}\), C) procedure is invoked to check if C satisfies Rule 2 (line 5). If so, \({\mathcal {R}}\) is updated with C by invoking Delete(\({\mathcal {R}}\)) and Insert(\({\mathcal {R}}\), C) (lines 6–7).

Complexity analysis First we analyze the time complexity of the Update procedure. Assume that an entry can be inserted to or deleted from a hash table in constant time. Thus, the time complexity of Size(\({\mathcal {R}}\), C), Delete(\({\mathcal {R}}\)), and Insert(\({\mathcal {R}}\), C) is O(|C|), \(O(|C^{*}({\mathcal {R}})|)\) and O(|C|), respectively. Consequently, the time complexity of Update is obviously \(O(\max \{|C|, |C^{*}({\mathcal {R}})|\})\).

The space cost for storing the result set \({\mathcal {R}}\) and maintaining the hash table M is \(O(\sum _{C^{\prime } \in {\mathcal {R}}} |C^{\prime }|)\), and the space cost for storing \(|{\varDelta }({\mathcal {R}}, C^{\prime })|\) and maintaining the hash table H is O(k). Thus, the space complexity of Update is \(O(2\sum _{C^{\prime } \in {\mathcal {R}}} |C_j| + 2k) = O(\sum _{C^{\prime } \in {\mathcal {R}}} |C^{\prime }|)\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, R., Zou, Z. & Li, J. Fast diversified coherent core search on multi-layer graphs. The VLDB Journal 28, 597–622 (2019). https://doi.org/10.1007/s00778-019-00542-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-019-00542-3

Keywords

Navigation