Skip to main content
Log in

Parallel mining of large maximal quasi-cliques

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Given a user-specified minimum degree threshold \(\gamma \), a \(\gamma \)-quasi-clique is a subgraph where each vertex connects to at least \(\gamma \) fraction of the other vertices. Quasi-clique is a natural definition for dense structures, so finding large and hence statistically significant quasi-cliques is useful in applications such as community detection in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasi-cliques is notoriously expensive, and even a recent algorithm for mining large maximal quasi-cliques is flawed and can lead to a lot of repeated searches. This paper proposes a parallel solution for mining maximal quasi-cliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task-subgraph size, by (a) using a timeout-based task decomposition strategy, and by (b) utilizing a priority task queue to schedule long-running tasks earlier for mining and decomposition to avoid stragglers. Unlike our conference version in PVLDB 2020 where the solution was built on a distributed graph mining framework called G-thinker, this paper targets a single-machine multi-core environment which is more accessible to an average end user. A general framework called T-thinker is developed to facilitate the programming of parallel programs for algorithms that adopt divide and conquer, including but not limited to our quasi-clique mining algorithm. Additionally, we consider the problem of directly mining large quasi-cliques from dense parts of a graph, where we identify the repeated search issue of a recent method and address it using a carefully designed concurrent trie data structure. Extensive experiments verify that our parallel solution scales well with the number of CPU cores, achieving 26.68\(\times \) runtime speedup when mining a graph with 3.77M vertices and 16.5M edges with 32 mining threads. Additionally, mining large quasi-cliques from dense parts can provide an additional speedup of up to 89.46\(\times \).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. https://github.com/beginner1010/topk-quasi-clique-enumeration

References

  1. Abello, J., Resende, M.G.C., Sudarsky, S.: Massive quasi-clique detection. In: LATIN, volume 2286 of Lecture Notes in Computer Science, pp. 598–612. Springer (2002)

  2. Bader, G.D., Hogue, C.W.: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 4(1), 2 (2003)

    Article  Google Scholar 

  3. Batagelj, V., Zaversnik, M.: An o(m) algorithm for cores decomposition of networks. CoRR, cs.DS/0310049 (2003)

  4. Bayardo Jr, R.J.: Efficiently mining long patterns from databases. In: SIGMOD Conference, pp. 85–93. ACM Press (1998)

  5. Berlowitz, D., Cohen, S., Kimelfeld, B.: Efficient enumeration of maximal k-plexes. In: SIGMOD Conference, pp. 431–444. ACM (2015)

  6. Bhattacharyya, M., Bandyopadhyay, S.: Mining the largest quasi-clique in human protein interactome. In: 2009 International Conference on Adaptive and Intelligent Systems, pp. 194–199. IEEE (2009)

  7. Brunato, M., Hoos, H.H., Battiti, R.: On effectively finding maximal quasi-cliques in graphs. In: International Conference on Learning and Intelligent Optimization, pp. 41–55. Springer, Berlin (2007)

  8. Bu, D., Zhao, Y., Cai, L., Xue, H., Zhu, X., Lu, H., Zhang, J., Sun, S., Ling, L., Zhang, N., et al.: Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Res. 31(9), 2443–2450 (2003)

    Article  Google Scholar 

  9. COST in the Land of Databases. https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md

  10. Chang, L., Yu, J.X., Qin, L., Lin, X., Liu, C., Liang, W.: Efficiently computing k-edge connected components via graph decomposition. In: SIGMOD Conference, pp. 205–216. ACM (2013)

  11. Chen, H., Liu, M., Zhao, Y., Yan, X., Yan, D., Cheng, J.: G-miner: an efficient task-oriented graph mining system. In: EuroSys, pp. 32:1–32:12. ACM (2018)

  12. Chou, Y.H., Wang, E.T., Chen, A.L.P.: Finding maximal quasi-cliques containing a target vertex in a graph. In: DATA, pp. 5–15. SciTePress (2015)

  13. Chu, S., Cheng, J.: Triangle listing in massive networks. TKDD 6(4), 17:1–17:32 (2012)

  14. Conde-Cespedes, P., Ngonmang, B., Viennet, E.: An efficient method for mining the maximal \(\alpha \)-quasi-clique-community of a given node in complex networks. Soc. Netw. Anal. Min. 8(1), 20 (2018)

    Article  Google Scholar 

  15. Conte, A., Firmani, D., Mordente, C., Patrignani, M., Torlone, R.: Fast enumeration of large k-plexes. In: SIGKDD, pp. 115–124. ACM (2017)

  16. Conte, A., Matteis, T.D., Sensi, D.D., Grossi, R., Marino, A., Versari, L.: D2K: scalable community detection in massive networks via small-diameter k-plexes. In: SIGKDD, pp. 1272–1281. ACM (2018)

  17. Cui, W., Xiao, Y., Wang, H., Lu, Y., Wang, W.: Online search of overlapping communities. In: SIGMOD Conference, pp. 277–288. ACM (2013)

  18. Fan, W., Jin, R., Liu, M., Lu, P., Luo, X., Xu, R., Yin, Q., Yu, W., Zhou, J.: Application driven graph partitioning. In: SIGMOD Conference, pp. 1765–1779. ACM (2020)

  19. Guo, G., Yan, D., Özsu, M.T., Jiang, Z., Khalil, J.: Scalable mining of maximal quasi-cliques: an algorithm-system codesign approach. Proc. VLDB Endow. 14(4), 573–585 (2020)

    Article  Google Scholar 

  20. Guo, G., Yan, D., T. Özsu, M., Jiang, Z., Khalil, J.: Scalable mining of maximal quasi-cliques: An algorithm-system codesign approach. CoRR, arXiv:2005.00081 (2020)

  21. Hopcroft, J., Khan, O., Kulis, B., Selman, B.: Tracking evolving communities in large linked networks. Proc. Natl. Acad. Sci. 101(suppl 1), 5249–5253 (2004)

    Article  Google Scholar 

  22. Jiang, D., Pei, J.: Mining frequent cross-graph quasi-cliques. ACM Trans. Knowl. Discov. Data 2(4), 16:1–16:42 (2009)

  23. Joshi, A., Zhang, Y., Bogdanov, P., Hwang, J.: An efficient system for subgraph discovery. In: IEEE Big Data, pp. 703–712 (2018)

  24. Lee, P., Lakshmanan, L.V.S.: Query-driven maximum quasi-clique search. In: SDM, pp. 522–530. SIAM (2016)

  25. Li, J., Wang, X., Cui, Y.: Uncovering the overlapping community structure of complex networks by maximal cliques. Physica A Stat. Mech. Appl. 415, 398–406 (2014)

    Article  MathSciNet  Google Scholar 

  26. Liu, G., Wong, L.: Effective pruning techniques for mining quasi-cliques. In: ECML/PKDD, volume 5212 of Lecture Notes in Computer Science, pp. 33–49. Springer, Berlin (2008)

  27. Lu, C., Yu, J.X., Wei, H., Zhang, Y.: Finding the maximum clique in massive graphs. Proc. VLDB Endow. 10(11), 1538–1549 (2017)

    Article  Google Scholar 

  28. Lyu, B., Qin, L., Lin, X., Zhang, Y., Qian, Z., Zhou, J.: Maximum biclique search at billion scale. Proc. VLDB Endow. 13(9), 1359–1372 (2020)

    Article  Google Scholar 

  29. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD Conference, pp. 135–146 (2010)

  30. Matsuda, H., Ishihara, T., Hashimoto, A.: Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci. 210(2), 305–325 (1999)

    Article  MathSciNet  Google Scholar 

  31. McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what cost? In: HotOS (2015)

  32. Pattillo, J., Veremyev, A., Butenko, S., Boginski, V.: On the maximum quasi-clique problem. Discrete Appl. Math. 161(1–2), 244–257 (2013)

    Article  MathSciNet  Google Scholar 

  33. Pei, J., Jiang, D., Zhang, A.: On mining cross-graph quasi-cliques. In: SIGKDD, pp. 228–238. ACM (2005)

  34. Qin, L., Yu, J.X., Chang, L., Cheng, H., Zhang, C., Lin, X.: Scalable big graph processing in mapreduce. In: SIGMOD Conference, pp. 827–838. ACM (2014)

  35. Quamar, A., Deshpande, A., Lin, J.: Nscale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J. 1–26 (2014)

  36. Sanei-Mehri, S., Das, A., Tirthapura, S.:Enumerating top-k quasi-cliques. In: IEEE BigData, pp. 1107–1112. IEEE (2018)

  37. Tanner, B.K., Warner, G., Stern, H., Olechowski, S.: Koobface: The evolution of the social botnet. In: eCrime, pp. 1–10. IEEE (2010)

  38. Teixeira, C.H.C., Fonseca, A.J., Serafini, M., Siganos, G., Zaki, M.J., Aboulnaga, A.: Arabesque: a system for distributed graph mining. In: SOSP, pp. 425–440 (2015)

  39. Wang, K., Zuo, Z., Thorpe, J., Nguyen, T.Q., Xu, G.H.: Rstream: Marrying relational algebra with streaming for efficient graph mining on A single machine. In: OSDI, pp. 763–782 (2018)

  40. Weiss, D., Warner, G.: Tracking criminals on facebook: a case study from a digital forensics reu program. In: Proceedings of Annual ADFSL Conference on Digital Forensics, Security and Law (2015)

  41. Yan, D., Bu, Y., Tian, Y., Deshpande, A.: Big graph analytics platforms. Found. Trends Databases 7(1–2), 1–195 (2017)

    Article  Google Scholar 

  42. Yan, D., Bu, Y., Tian, Y., Deshpande, A., Cheng, J.: Big graph analytics systems. In: SIGMOD Conference, pp. 2241–2243. ACM (2016)

  43. Yan, D., Cheng, J., Chen, H., Long, C., Bangalore, P.: Lightweight fault tolerance in pregel-like systems. In: ICPP, pp. 69:1–69:10. ACM (2019)

  44. Yan, D., Cheng, J., Lu, Y., Ng, W.: Blogel: a block-centric framework for distributed computation on real-world graphs. Proc. VLDB Endow. 7(14), 1981–1992 (2014)

    Article  Google Scholar 

  45. Yan, D., Cheng, J., Lu, Y., Ng, W.: Effective techniques for message reduction and load balancing in distributed graph computation. In: WWW, pp. 1307–1317 (2015)

  46. Yan, D., Cheng, J., Özsu, M.T., Yang, F., Lu, Y., Lui, J.C.S., Zhang, Q., Ng, W.: A general-purpose query-centric framework for querying big graphs. Proc. VLDB Endow. 9(7), 564–575 (2016)

    Article  Google Scholar 

  47. Yan, D., Cheng, J., Xing, K., Lu, Y., Ng, W., Bu, Y.: Pregel algorithms for graph connectivity problems with performance guarantees. PVLDB 7(14), 1821–1832 (2014)

    Google Scholar 

  48. Yan, D., Guo, G.: Systems and algorithms for massively parallel graph mining. In: BigData. IEEE (2020)

  49. Yan, D., Guo, G., Chowdhury, M.M.R., Özsu, M.T., Ku, W., Lui, J.C.S.: G-thinker: a distributed framework for mining subgraphs in a big graph. In: ICDE, pp. 1369–1380. IEEE (2020)

  50. Yan, D., Guo, G., Khalil, J. et al. G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00688-z

  51. Yan, D., Guo, G., Chowdhury, M.M.R., Özsu, M.T., Lui, J.C.S., Tan, W.: T-thinker: a task-centric distributed framework for compute-intensive divide-and-conquer algorithms. In: PPoPP, pp. 411–412. ACM (2019)

  52. Yan, D., Huang, Y., Liu, M., Chen, H., Cheng, J., Wu, H., Zhang, C.: Graphd: Distributed vertex-centric graph processing beyond the memory limit. IEEE Trans. Parallel Distrib. Syst. 29(1), 99–114 (2018)

    Article  Google Scholar 

  53. Yan, D., Liu, H.: Parallel graph processing. In: Encyclopedia of Big Data Technologies. Springer (2019)

  54. Yan, D., Qu, W., Guo, G., Wang, X.: Prefixfpm: A parallel framework for general-purpose frequent pattern mining. In: ICDE, pp. 1938–1941. IEEE (2020)

  55. Yan, D., Qu, W., Guo, G. et al.: PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00687-0

  56. Yan, D., Tian, Y., Cheng, J.: Systems for Big Graph Analytics. Springer Briefs in Computer Science. Springer (2017)

  57. Yang, Y., Yan, D., Wu, H., Cheng, J., Zhou, S., Lui, J.C.S.: Diversified temporal subgraph pattern mining. In: SIGKDD, pp. 1965–1974. ACM (2016)

  58. Zeng, Z., Wang, J., Zhou, L., Karypis, G.: Coherent closed quasi-clique discovery from large dense graph databases. In: SIGKDD, pp. 797–802. ACM (2006)

  59. Zhang, Q., Yan, D., Cheng, J.: Quegel: A general-purpose system for querying big graphs. In: SIGMOD Conference, pp. 2189–2192. ACM (2016)

  60. Zhou, Y., Xu, J., Guo, Z., Xiao, M., Jin, Y.: Enumerating maximal k-plexes with worst-case time guarantee. In: AAAI, pp. 2442–2449. AAAI Press (2020)

Download references

Acknowledgements

This work was supported by NSF OAC-1755464 (CRII) and DGE-1723250 (SaTC). Guimu Guo acknowledges financial support from the Alabama Graduate Research Scholars Program (GRSP) funded through the Alabama Commission for Higher Education and administered by the Alabama EPSCoR.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Da Yan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khalil, J., Yan, D., Guo, G. et al. Parallel mining of large maximal quasi-cliques. The VLDB Journal 31, 649–674 (2022). https://doi.org/10.1007/s00778-021-00712-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-021-00712-2

Keywords

Navigation