Skip to main content

SPAC: Scalable Pattern Approximate Counting in Graph Mining

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2022)

Abstract

Pattern counting is a crucial task in graph pattern mining. Accurate counting is not affordable as the datasets grow larger and larger, and approximate counting is getting popular to provide an estimated answer quickly. However, current approximate counting approaches are still time-consuming and not scalable for extra-large graphs. This paper proposes SPAC, a fast and flexible pattern approximate counting method, based on the observation that pattern number distribution to degrees also follows power-law as the vertices, the common feature in graph datasets. By leveraging the distribution, SPAC can efficiently choose a small number of degrees as samples, fit the coefficients, and then calculate the pattern frequency directly. To provide flexibility for different use-cases, SPAC supports both accurate and approximate counting in the sampling phase. Moreover, edge weighting and interpolation techniques are adopted to emphasize the sample tail to improve fitting accuracy. The prototype of SPAC is implemented with GraphX on Spark, and is evaluated against various well-known graphs. The experimental results show that SPAC is up to 10x faster than accurate counting, keeping the same error level below 10%. Compared to existing approximate counting, SPAC is 1.4x–9x faster in general, while the error could be reduced to 20% of the current systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abou-Rjeili, A., Karypis, G.: Multilevel algorithms for partitioning power-law graphs. In: Proceedings of the 20th International Conference on Parallel and Distributed Processing. IPDPS 2006, p. 124. IEEE Computer Society, USA (2006)

    Google Scholar 

  2. Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems. EuroSys 2013, pp. 29–42. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2465351.2465355

  3. Ahmed, N.K., Duffield, N., Neville, J., Kompella, R.: Graph sample and hold: a framework for big-graph analytics. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2014, pp. 1446–1455. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2623330.2623757

  4. Ananthanarayanan, G., Hung, M.C.C., Ren, X., Stoica, I., Wierman, A., Yu, M.: GRASS: trimming stragglers in approximation analytics. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pp. 289–302. USENIX Association, Seattle, April 2014. https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/ananthanarayanan

  5. Barabási, A.L., Pósfai, M.: Network Science. Cambridge University Press, Cambridge (2016). http://barabasi.com/networksciencebook/

  6. Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16(9), 575–577 (1973). https://doi.org/10.1145/362342.362367

  7. Chung, F., Lu, L., Vu, V.: Eigenvalues of random power law graphs. Ann. Comb. 7(1), 21–33 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  8. Cook, S.A.: The complexity of theorem-proving procedures. In: Proceedings of the Third Annual ACM Symposium on Theory of Computing. STOC 1971, pp. 151–158. Association for Computing Machinery, New York (1971). https://doi.org/10.1145/800157.805047

  9. Danisch, M., Balalau, O., Sozio, M.: Listing k-cliques in sparse real-world graphs*. In: Proceedings of the 2018 World Wide Web Conference. WWW 2018, pp. 589–598. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2018). https://doi.org/10.1145/3178876.3186125

  10. Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: Grami: frequent subgraph and pattern mining in a single large graph. Proc. VLDB Endow. 7(7), 517–528 (2014). https://doi.org/10.14778/2732286.2732289

  11. Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Jacquet, P. (ed.) AofA: Analysis of Algorithms. DMTCS Proceedings, vol. DMTCS Proceedings, vol. AH, 2007 Conference on Analysis of Algorithms (AofA 2007), pp. 137–156. Discrete Mathematics and Theoretical Computer Science, Juan les Pins, France, June 2007. https://doi.org/10.46298/dmtcs.3545, https://hal.inria.fr/hal-00406166

  12. Gao, P., van der Hofstad, R., Southwell, A., Stegehuis, C.: Counting triangles in power-law uniform random graphs (2018). https://doi.org/10.48550/ARXIV.1812.04289, https://arxiv.org/abs/1812.04289

  13. Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. SIGMOD 2008, pp. 379–392, Association for Computing Machinery, New York (2008). https://doi.org/10.1145/1376616.1376657

  14. Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2014), pp. 599–613. USENIX Association, Broomfield, October 2014. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/gonzalez

  15. Gou, X., Zou, L.: Sliding window-based approximate triangle counting over streaming graphs with duplicate edges. In: Proceedings of the 2021 International Conference on Management of Data. SIGMOD 2021, pp. 645–657. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3448016.3452800

  16. Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS, vol. 4453, pp. 92–106. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71681-5_7

    Chapter  Google Scholar 

  17. Iyer, A.P., Liu, Z., Jin, X., Venkataraman, S., Braverman, V., Stoica, I.: ASAP: fast, approximate graph pattern mining at scale. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 745–761. USENIX Association, October 2018. https://www.usenix.org/conference/osdi18/presentation/iyer

  18. Jung, M., Lim, Y., Lee, S., Kang, U.: FURL: fixed-memory and uncertainty reducing local triangle counting for multigraph streams. Data Min. Knowl. Disc. 33(5), 1225–1253 (2019). https://doi.org/10.1007/s10618-019-00630-6

    Article  MathSciNet  MATH  Google Scholar 

  19. Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746–1758 (2004). https://doi.org/10.1093/bioinformatics/bth163

  20. Koutra, D., Jin, D., Ning, Y., Faloutsos, C.: Perseus: an interactive large-scale graph mining and visualization tool. Proc. VLDB Endow. 8(12), 1924–1927 (2015). https://doi.org/10.14778/2824032.2824102

  21. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6(1), 29–123 (2009). https://doi.org/10.1080/15427951.2009.10129177

    Article  MathSciNet  MATH  Google Scholar 

  22. Lim, Y., Jung, M., Kang, U.: Memory-efficient and accurate sampling for counting local triangles in graph streams: from simple to multigraphs. ACM Trans. Knowl. Discov. Data 12(1) (2018). https://doi.org/10.1145/3022186

  23. McAuley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. NIPS 2012, vol. 1, pp. 539–547. Curran Associates Inc., Red Hook (2012)

    Google Scholar 

  24. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002). https://doi.org/10.1126/science.298.5594.824

    Article  Google Scholar 

  25. Montgomery, D.C., Peck, E.A., Vining, G.G.: Introduction to Linear Regression Analysis, 4th edn. Wiley, Hoboken (2006)

    Google Scholar 

  26. Pashanasangi, N., Seshadhri, C.: Faster and generalized temporal triangle counting, via degeneracy ordering. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD 2021, pp. 1319–1328. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3447548.3467374

  27. Pržulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: scale-free or geometric? Bioinformatics 20(18), 3508–3515 (2004). https://doi.org/10.1093/bioinformatics/bth436

  28. Ribeiro, P., Silva, F.: G-tries: an efficient data structure for discovering network motifs. In: Proceedings of the 2010 ACM Symposium on Applied Computing. SAC 2010, pp. 1559–1566. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1774088.1774422

  29. Richardson, M., Agrawal, R., Domingos, P.: Trust management for the semantic web. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 351–368. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39718-2_23

    Chapter  Google Scholar 

  30. Rozemberczki, B., Allen, C., Sarkar, R.: Multi-scale attributed node embedding (2019). https://doi.org/10.48550/ARXIV.1909.13021, https://arxiv.org/abs/1909.13021

  31. Rozemberczki, B., Sarkar, R.: Twitch gamers: a dataset for evaluating proximity preserving and structural role-based node embeddings (2021). https://doi.org/10.48550/ARXIV.2101.03091, https://arxiv.org/abs/2101.03091

  32. Takac, L., Zábovský, M.: Data analysis in public social networks. In: International Scientific Conference and International Workshop Present Day Trends of Innovations, pp. 1–6, January 2012

    Google Scholar 

  33. Teixeira, C.H.C., Fonseca, A.J., Serafini, M., Siganos, G., Zaki, M.J., Aboulnaga, A.: Arabesque: a system for distributed graph mining. In: Proceedings of the 25th Symposium on Operating Systems Principles. SOSP 2015, pp. 425–440. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2815400.2815410

  34. Vázquez, A., Pastor-Satorras, R., Vespignani, A.: Large-scale topological and dynamical properties of the internet. Phys. Rev. E 65, 066130 (2002). https://doi.org/10.1103/PhysRevE.65.066130, https://link.aps.org/doi/10.1103/PhysRevE.65.066130

  35. Wang, P., Qi, Y., Sun, Y., Zhang, X., Tao, J., Guan, X.: Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage. Proc. VLDB Endow. 11(2), 162–175 (2017). https://doi.org/10.14778/3149193.3149197

  36. Wu, M., et al.: Gram: scaling graph computation to the trillions. In: Proceedings of the Sixth ACM Symposium on Cloud Computing. SoCC 2015, pp. 408–421. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2806777.2806849

  37. Yan, X., Han, J.: GSPAN: graph-based substructure pattern mining. In: 2002 IEEE International Conference on Data Mining, Proceedings, pp. 721–724 (2002). https://doi.org/10.1109/ICDM.2002.1184038

  38. Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 42(1), 181–213 (2013). https://doi.org/10.1007/s10115-013-0693-z

    Article  Google Scholar 

  39. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud 2010, p. 10. USENIX Association, USA (2010)

    Google Scholar 

Download references

Acknowledgement

This research is partially supported by National Key Research and Development Program of China with ID 2018AAA0103203 and PCL Peng Cheng Cloud Brain with ID PCL2021A13. We thank Yingchun Ma for valuable comments on early versions and we thank the reviewers for their insights.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruini Xue .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xue, R., Wang, Y., Liu, S., Li, Y., Tian, W., Zheng, W. (2023). SPAC: Scalable Pattern Approximate Counting in Graph Mining. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-22677-9_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-22676-2

  • Online ISBN: 978-3-031-22677-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics