Abstract
Pattern counting is a crucial task in graph pattern mining. Accurate counting is not affordable as the datasets grow larger and larger, and approximate counting is getting popular to provide an estimated answer quickly. However, current approximate counting approaches are still time-consuming and not scalable for extra-large graphs. This paper proposes SPAC, a fast and flexible pattern approximate counting method, based on the observation that pattern number distribution to degrees also follows power-law as the vertices, the common feature in graph datasets. By leveraging the distribution, SPAC can efficiently choose a small number of degrees as samples, fit the coefficients, and then calculate the pattern frequency directly. To provide flexibility for different use-cases, SPAC supports both accurate and approximate counting in the sampling phase. Moreover, edge weighting and interpolation techniques are adopted to emphasize the sample tail to improve fitting accuracy. The prototype of SPAC is implemented with GraphX on Spark, and is evaluated against various well-known graphs. The experimental results show that SPAC is up to 10x faster than accurate counting, keeping the same error level below 10%. Compared to existing approximate counting, SPAC is 1.4x–9x faster in general, while the error could be reduced to 20% of the current systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abou-Rjeili, A., Karypis, G.: Multilevel algorithms for partitioning power-law graphs. In: Proceedings of the 20th International Conference on Parallel and Distributed Processing. IPDPS 2006, p. 124. IEEE Computer Society, USA (2006)
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems. EuroSys 2013, pp. 29–42. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2465351.2465355
Ahmed, N.K., Duffield, N., Neville, J., Kompella, R.: Graph sample and hold: a framework for big-graph analytics. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2014, pp. 1446–1455. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2623330.2623757
Ananthanarayanan, G., Hung, M.C.C., Ren, X., Stoica, I., Wierman, A., Yu, M.: GRASS: trimming stragglers in approximation analytics. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pp. 289–302. USENIX Association, Seattle, April 2014. https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/ananthanarayanan
Barabási, A.L., Pósfai, M.: Network Science. Cambridge University Press, Cambridge (2016). http://barabasi.com/networksciencebook/
Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16(9), 575–577 (1973). https://doi.org/10.1145/362342.362367
Chung, F., Lu, L., Vu, V.: Eigenvalues of random power law graphs. Ann. Comb. 7(1), 21–33 (2003)
Cook, S.A.: The complexity of theorem-proving procedures. In: Proceedings of the Third Annual ACM Symposium on Theory of Computing. STOC 1971, pp. 151–158. Association for Computing Machinery, New York (1971). https://doi.org/10.1145/800157.805047
Danisch, M., Balalau, O., Sozio, M.: Listing k-cliques in sparse real-world graphs*. In: Proceedings of the 2018 World Wide Web Conference. WWW 2018, pp. 589–598. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2018). https://doi.org/10.1145/3178876.3186125
Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: Grami: frequent subgraph and pattern mining in a single large graph. Proc. VLDB Endow. 7(7), 517–528 (2014). https://doi.org/10.14778/2732286.2732289
Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Jacquet, P. (ed.) AofA: Analysis of Algorithms. DMTCS Proceedings, vol. DMTCS Proceedings, vol. AH, 2007 Conference on Analysis of Algorithms (AofA 2007), pp. 137–156. Discrete Mathematics and Theoretical Computer Science, Juan les Pins, France, June 2007. https://doi.org/10.46298/dmtcs.3545, https://hal.inria.fr/hal-00406166
Gao, P., van der Hofstad, R., Southwell, A., Stegehuis, C.: Counting triangles in power-law uniform random graphs (2018). https://doi.org/10.48550/ARXIV.1812.04289, https://arxiv.org/abs/1812.04289
Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. SIGMOD 2008, pp. 379–392, Association for Computing Machinery, New York (2008). https://doi.org/10.1145/1376616.1376657
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2014), pp. 599–613. USENIX Association, Broomfield, October 2014. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/gonzalez
Gou, X., Zou, L.: Sliding window-based approximate triangle counting over streaming graphs with duplicate edges. In: Proceedings of the 2021 International Conference on Management of Data. SIGMOD 2021, pp. 645–657. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3448016.3452800
Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS, vol. 4453, pp. 92–106. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71681-5_7
Iyer, A.P., Liu, Z., Jin, X., Venkataraman, S., Braverman, V., Stoica, I.: ASAP: fast, approximate graph pattern mining at scale. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 745–761. USENIX Association, October 2018. https://www.usenix.org/conference/osdi18/presentation/iyer
Jung, M., Lim, Y., Lee, S., Kang, U.: FURL: fixed-memory and uncertainty reducing local triangle counting for multigraph streams. Data Min. Knowl. Disc. 33(5), 1225–1253 (2019). https://doi.org/10.1007/s10618-019-00630-6
Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746–1758 (2004). https://doi.org/10.1093/bioinformatics/bth163
Koutra, D., Jin, D., Ning, Y., Faloutsos, C.: Perseus: an interactive large-scale graph mining and visualization tool. Proc. VLDB Endow. 8(12), 1924–1927 (2015). https://doi.org/10.14778/2824032.2824102
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6(1), 29–123 (2009). https://doi.org/10.1080/15427951.2009.10129177
Lim, Y., Jung, M., Kang, U.: Memory-efficient and accurate sampling for counting local triangles in graph streams: from simple to multigraphs. ACM Trans. Knowl. Discov. Data 12(1) (2018). https://doi.org/10.1145/3022186
McAuley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. NIPS 2012, vol. 1, pp. 539–547. Curran Associates Inc., Red Hook (2012)
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002). https://doi.org/10.1126/science.298.5594.824
Montgomery, D.C., Peck, E.A., Vining, G.G.: Introduction to Linear Regression Analysis, 4th edn. Wiley, Hoboken (2006)
Pashanasangi, N., Seshadhri, C.: Faster and generalized temporal triangle counting, via degeneracy ordering. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD 2021, pp. 1319–1328. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3447548.3467374
Pržulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: scale-free or geometric? Bioinformatics 20(18), 3508–3515 (2004). https://doi.org/10.1093/bioinformatics/bth436
Ribeiro, P., Silva, F.: G-tries: an efficient data structure for discovering network motifs. In: Proceedings of the 2010 ACM Symposium on Applied Computing. SAC 2010, pp. 1559–1566. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1774088.1774422
Richardson, M., Agrawal, R., Domingos, P.: Trust management for the semantic web. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 351–368. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39718-2_23
Rozemberczki, B., Allen, C., Sarkar, R.: Multi-scale attributed node embedding (2019). https://doi.org/10.48550/ARXIV.1909.13021, https://arxiv.org/abs/1909.13021
Rozemberczki, B., Sarkar, R.: Twitch gamers: a dataset for evaluating proximity preserving and structural role-based node embeddings (2021). https://doi.org/10.48550/ARXIV.2101.03091, https://arxiv.org/abs/2101.03091
Takac, L., Zábovský, M.: Data analysis in public social networks. In: International Scientific Conference and International Workshop Present Day Trends of Innovations, pp. 1–6, January 2012
Teixeira, C.H.C., Fonseca, A.J., Serafini, M., Siganos, G., Zaki, M.J., Aboulnaga, A.: Arabesque: a system for distributed graph mining. In: Proceedings of the 25th Symposium on Operating Systems Principles. SOSP 2015, pp. 425–440. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2815400.2815410
Vázquez, A., Pastor-Satorras, R., Vespignani, A.: Large-scale topological and dynamical properties of the internet. Phys. Rev. E 65, 066130 (2002). https://doi.org/10.1103/PhysRevE.65.066130, https://link.aps.org/doi/10.1103/PhysRevE.65.066130
Wang, P., Qi, Y., Sun, Y., Zhang, X., Tao, J., Guan, X.: Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage. Proc. VLDB Endow. 11(2), 162–175 (2017). https://doi.org/10.14778/3149193.3149197
Wu, M., et al.: Gram: scaling graph computation to the trillions. In: Proceedings of the Sixth ACM Symposium on Cloud Computing. SoCC 2015, pp. 408–421. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2806777.2806849
Yan, X., Han, J.: GSPAN: graph-based substructure pattern mining. In: 2002 IEEE International Conference on Data Mining, Proceedings, pp. 721–724 (2002). https://doi.org/10.1109/ICDM.2002.1184038
Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 42(1), 181–213 (2013). https://doi.org/10.1007/s10115-013-0693-z
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud 2010, p. 10. USENIX Association, USA (2010)
Acknowledgement
This research is partially supported by National Key Research and Development Program of China with ID 2018AAA0103203 and PCL Peng Cheng Cloud Brain with ID PCL2021A13. We thank Yingchun Ma for valuable comments on early versions and we thank the reviewers for their insights.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Xue, R., Wang, Y., Liu, S., Li, Y., Tian, W., Zheng, W. (2023). SPAC: Scalable Pattern Approximate Counting in Graph Mining. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-22677-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22676-2
Online ISBN: 978-3-031-22677-9
eBook Packages: Computer ScienceComputer Science (R0)