SPAC: Scalable Pattern Approximate Counting in Graph Mining

Xue, Ruini; Wang, Yijun; Liu, Shengbo; Li, Yunxiang; Tian, Wenhong; Zheng, Weimin

doi:10.1007/978-3-031-22677-9_12

Ruini Xue ORCID: orcid.org/0000-0003-1802-5188^11,13,
Yijun Wang¹¹,
Shengbo Liu¹¹,
Yunxiang Li¹¹,
Wenhong Tian ORCID: orcid.org/0000-0002-5551-9796¹¹ &
…
Weimin Zheng^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13777))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1422 Accesses

Abstract

Pattern counting is a crucial task in graph pattern mining. Accurate counting is not affordable as the datasets grow larger and larger, and approximate counting is getting popular to provide an estimated answer quickly. However, current approximate counting approaches are still time-consuming and not scalable for extra-large graphs. This paper proposes SPAC, a fast and flexible pattern approximate counting method, based on the observation that pattern number distribution to degrees also follows power-law as the vertices, the common feature in graph datasets. By leveraging the distribution, SPAC can efficiently choose a small number of degrees as samples, fit the coefficients, and then calculate the pattern frequency directly. To provide flexibility for different use-cases, SPAC supports both accurate and approximate counting in the sampling phase. Moreover, edge weighting and interpolation techniques are adopted to emphasize the sample tail to improve fitting accuracy. The prototype of SPAC is implemented with GraphX on Spark, and is evaluated against various well-known graphs. The experimental results show that SPAC is up to 10x faster than accurate counting, keeping the same error level below 10%. Compared to existing approximate counting, SPAC is 1.4x–9x faster in general, while the error could be reduced to 20% of the current systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abou-Rjeili, A., Karypis, G.: Multilevel algorithms for partitioning power-law graphs. In: Proceedings of the 20th International Conference on Parallel and Distributed Processing. IPDPS 2006, p. 124. IEEE Computer Society, USA (2006)
Google Scholar
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems. EuroSys 2013, pp. 29–42. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2465351.2465355
Ahmed, N.K., Duffield, N., Neville, J., Kompella, R.: Graph sample and hold: a framework for big-graph analytics. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2014, pp. 1446–1455. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2623330.2623757
Ananthanarayanan, G., Hung, M.C.C., Ren, X., Stoica, I., Wierman, A., Yu, M.: GRASS: trimming stragglers in approximation analytics. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pp. 289–302. USENIX Association, Seattle, April 2014. https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/ananthanarayanan
Barabási, A.L., Pósfai, M.: Network Science. Cambridge University Press, Cambridge (2016). http://barabasi.com/networksciencebook/
Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16(9), 575–577 (1973). https://doi.org/10.1145/362342.362367
Chung, F., Lu, L., Vu, V.: Eigenvalues of random power law graphs. Ann. Comb. 7(1), 21–33 (2003)
Article MathSciNet MATH Google Scholar
Cook, S.A.: The complexity of theorem-proving procedures. In: Proceedings of the Third Annual ACM Symposium on Theory of Computing. STOC 1971, pp. 151–158. Association for Computing Machinery, New York (1971). https://doi.org/10.1145/800157.805047
Danisch, M., Balalau, O., Sozio, M.: Listing k-cliques in sparse real-world graphs*. In: Proceedings of the 2018 World Wide Web Conference. WWW 2018, pp. 589–598. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2018). https://doi.org/10.1145/3178876.3186125
Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: Grami: frequent subgraph and pattern mining in a single large graph. Proc. VLDB Endow. 7(7), 517–528 (2014). https://doi.org/10.14778/2732286.2732289
Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Jacquet, P. (ed.) AofA: Analysis of Algorithms. DMTCS Proceedings, vol. DMTCS Proceedings, vol. AH, 2007 Conference on Analysis of Algorithms (AofA 2007), pp. 137–156. Discrete Mathematics and Theoretical Computer Science, Juan les Pins, France, June 2007. https://doi.org/10.46298/dmtcs.3545, https://hal.inria.fr/hal-00406166
Gao, P., van der Hofstad, R., Southwell, A., Stegehuis, C.: Counting triangles in power-law uniform random graphs (2018). https://doi.org/10.48550/ARXIV.1812.04289, https://arxiv.org/abs/1812.04289
Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. SIGMOD 2008, pp. 379–392, Association for Computing Machinery, New York (2008). https://doi.org/10.1145/1376616.1376657
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2014), pp. 599–613. USENIX Association, Broomfield, October 2014. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/gonzalez
Gou, X., Zou, L.: Sliding window-based approximate triangle counting over streaming graphs with duplicate edges. In: Proceedings of the 2021 International Conference on Management of Data. SIGMOD 2021, pp. 645–657. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3448016.3452800
Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS, vol. 4453, pp. 92–106. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71681-5_7
Chapter Google Scholar
Iyer, A.P., Liu, Z., Jin, X., Venkataraman, S., Braverman, V., Stoica, I.: ASAP: fast, approximate graph pattern mining at scale. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 745–761. USENIX Association, October 2018. https://www.usenix.org/conference/osdi18/presentation/iyer
Jung, M., Lim, Y., Lee, S., Kang, U.: FURL: fixed-memory and uncertainty reducing local triangle counting for multigraph streams. Data Min. Knowl. Disc. 33(5), 1225–1253 (2019). https://doi.org/10.1007/s10618-019-00630-6
Article MathSciNet MATH Google Scholar
Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746–1758 (2004). https://doi.org/10.1093/bioinformatics/bth163
Koutra, D., Jin, D., Ning, Y., Faloutsos, C.: Perseus: an interactive large-scale graph mining and visualization tool. Proc. VLDB Endow. 8(12), 1924–1927 (2015). https://doi.org/10.14778/2824032.2824102
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6(1), 29–123 (2009). https://doi.org/10.1080/15427951.2009.10129177
Article MathSciNet MATH Google Scholar
Lim, Y., Jung, M., Kang, U.: Memory-efficient and accurate sampling for counting local triangles in graph streams: from simple to multigraphs. ACM Trans. Knowl. Discov. Data 12(1) (2018). https://doi.org/10.1145/3022186
McAuley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. NIPS 2012, vol. 1, pp. 539–547. Curran Associates Inc., Red Hook (2012)
Google Scholar
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002). https://doi.org/10.1126/science.298.5594.824
Article Google Scholar
Montgomery, D.C., Peck, E.A., Vining, G.G.: Introduction to Linear Regression Analysis, 4th edn. Wiley, Hoboken (2006)
Google Scholar
Pashanasangi, N., Seshadhri, C.: Faster and generalized temporal triangle counting, via degeneracy ordering. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD 2021, pp. 1319–1328. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3447548.3467374
Pržulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: scale-free or geometric? Bioinformatics 20(18), 3508–3515 (2004). https://doi.org/10.1093/bioinformatics/bth436
Ribeiro, P., Silva, F.: G-tries: an efficient data structure for discovering network motifs. In: Proceedings of the 2010 ACM Symposium on Applied Computing. SAC 2010, pp. 1559–1566. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1774088.1774422
Richardson, M., Agrawal, R., Domingos, P.: Trust management for the semantic web. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 351–368. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39718-2_23
Chapter Google Scholar
Rozemberczki, B., Allen, C., Sarkar, R.: Multi-scale attributed node embedding (2019). https://doi.org/10.48550/ARXIV.1909.13021, https://arxiv.org/abs/1909.13021
Rozemberczki, B., Sarkar, R.: Twitch gamers: a dataset for evaluating proximity preserving and structural role-based node embeddings (2021). https://doi.org/10.48550/ARXIV.2101.03091, https://arxiv.org/abs/2101.03091
Takac, L., Zábovský, M.: Data analysis in public social networks. In: International Scientific Conference and International Workshop Present Day Trends of Innovations, pp. 1–6, January 2012
Google Scholar
Teixeira, C.H.C., Fonseca, A.J., Serafini, M., Siganos, G., Zaki, M.J., Aboulnaga, A.: Arabesque: a system for distributed graph mining. In: Proceedings of the 25th Symposium on Operating Systems Principles. SOSP 2015, pp. 425–440. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2815400.2815410
Vázquez, A., Pastor-Satorras, R., Vespignani, A.: Large-scale topological and dynamical properties of the internet. Phys. Rev. E 65, 066130 (2002). https://doi.org/10.1103/PhysRevE.65.066130, https://link.aps.org/doi/10.1103/PhysRevE.65.066130
Wang, P., Qi, Y., Sun, Y., Zhang, X., Tao, J., Guan, X.: Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage. Proc. VLDB Endow. 11(2), 162–175 (2017). https://doi.org/10.14778/3149193.3149197
Wu, M., et al.: Gram: scaling graph computation to the trillions. In: Proceedings of the Sixth ACM Symposium on Cloud Computing. SoCC 2015, pp. 408–421. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2806777.2806849
Yan, X., Han, J.: GSPAN: graph-based substructure pattern mining. In: 2002 IEEE International Conference on Data Mining, Proceedings, pp. 721–724 (2002). https://doi.org/10.1109/ICDM.2002.1184038
Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 42(1), 181–213 (2013). https://doi.org/10.1007/s10115-013-0693-z
Article Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud 2010, p. 10. USENIX Association, USA (2010)
Google Scholar

Download references

Acknowledgement

This research is partially supported by National Key Research and Development Program of China with ID 2018AAA0103203 and PCL Peng Cheng Cloud Brain with ID PCL2021A13. We thank Yingchun Ma for valuable comments on early versions and we thank the reviewers for their insights.

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Ruini Xue, Yijun Wang, Shengbo Liu, Yunxiang Li & Wenhong Tian
Tsinghua University, Beijing, China
Weimin Zheng
Peng Cheng Lab (PCL), Shenzhen, China
Ruini Xue & Weimin Zheng

Authors

Ruini Xue
View author publications
You can also search for this author in PubMed Google Scholar
Yijun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shengbo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yunxiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenhong Tian
View author publications
You can also search for this author in PubMed Google Scholar
Weimin Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruini Xue .

Editor information

Editors and Affiliations

Technical University of Denmark, Kongens Lyngby, Denmark
Weizhi Meng
University of New Brunswick, Fredericton, NB, Canada
Rongxing Lu
University of Exeter, Exeter, UK
Geyong Min
Rutgers University, Newark, NJ, USA
Jaideep Vaidya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xue, R., Wang, Y., Liu, S., Li, Y., Tian, W., Zheng, W. (2023). SPAC: Scalable Pattern Approximate Counting in Graph Mining. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-22677-9_12
Published: 11 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22676-2
Online ISBN: 978-3-031-22677-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SPAC: Scalable Pattern Approximate Counting in Graph Mining