skip to main content
10.1145/3448016.3457246acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation

Published: 18 June 2021 Publication History

Abstract

Graph pattern cardinality estimation is the problem of estimating the number of embeddings of a query graph in a data graph. This fundamental problem arises, for example, during query planning in subgraph matching algorithms. There are two major approaches to solving the problem: sampling and synopsis. Synopsis (or summary)-based methods are fast and accurate if synopses capture information of graphs well. However, these methods suffer from large errors due to loss of information during summarization and inherent assumptions. Sampling-based methods are unbiased but suffer from large estimation variance due to large sample space. To address these limitations, we propose Alley, a hybrid method that combines both sampling and synopses. Alley employs 1) a novel sampling strategy, random walk with intersection, which effectively reduces the sample space, 2) branching to further reduce variance, and 3) a novel mining approach that extracts and indexes tangled patterns as synopses which are inherently difficult to estimate by sampling. By using them in the online estimation phase, we can effectively reduce the sample space while still ensuring unbiasedness. We establish that Alley has worst-case optimal runtime and approximation quality guarantees for any given error bound ε and required confidence μ. In addition to the theoretical aspect of Alley, our extensive experiments show that Alley outperforms the state-of-the-art methods by up to orders of magnitude higher accuracy with similar efficiency.

Supplementary Material

MP4 File (3448016.3457246.mp4)
Graph pattern cardinality estimation is the problem of estimating the number of embedding |M| of a query graph in a data graph. This fundamental problem arises, for example, during query planning in subgraph matching algorithms. There are two major approaches to solving the problem: sampling and synopsis. Synopsis (or summary)-based methods are fast and accurate if synopses capture information of graphs well. However, these methods suffer from large errors due to loss of information during summarization and inherent assumptions. Sampling-based methods are unbiased but suffer from large estimation variance due to large sample space. To address these limitations, we propose Alley, a hybrid method that combines both sampling and synopses. Alley employs 1) a novel sampling strategy, random walk with intersection, which effectively reduces the sample space, 2) branching to further reduce variance, and 3) a novel mining approach that extracts and indexes tangled patterns as synopses which are inherently difficult to estimate by sampling. By using them in the online estimation phase, we can effectively reduce the sample space while still ensuringun biasedness. We establish that Alley has worst-case optimal runtime and approximation quality guarantees. That is, for any given error bound and required confidence, Alley guarantees that the estimation error is bounded by with confidence ; if Z denotes the random variable for the estimate, Pr(|Z|M|| < |M|) > . In addition to the theoretical aspect of Alley, our extensive experiments show that Alley outperforms the state-of-the-art methods by up to orders of magnitude higher accuracy with similar efficiency.

References

[1]
Ehab Abdelhamid, Ibrahim Abdelaziz, Panos Kalnis, Zuhair Khayyat, and Fuad Jamour. 2016. Scalemine: Scalable parallel frequent subgraph mining in a single large graph. In SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 716--727.
[2]
Christopher R Aberger, Susan Tu, Kunle Olukotun, and Christopher Ré. 2016. Emptyheaded: A relational engine for graph processing. In Proceedings of the 2016 International Conference on Management of Data. ACM, 431--446.
[3]
Ashraf Aboulnaga, Alaa R Alameldeen, and Jeffrey F Naughton. 2001. Estimating the selectivity of XML path expressions for internet scale applications. In VLDB, Vol. 1. Citeseer, 591--600.
[4]
Maryam Aliakbarpour, Amartya Shankha Biswas, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee. 2018. Sublinear-time algorithms for counting star subgraphs via edge sampling. Algorithmica, Vol. 80, 2 (2018), 668--697.
[5]
Molham Aref, Balder ten Cate, Todd J Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L Veldhuizen, and Geoffrey Washburn. 2015. Design and implementation of the LogicBlox system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1371--1382.
[6]
Sepehr Assadi, Michael Kapralov, and Sanjeev Khanna. 2019. A Simple Sublinear-Time Algorithm for Counting Arbitrary Subgraphs via Edge Sampling. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10--12, 2019, San Diego, California, USA (LIPIcs, Vol. 124), Avrim Blum (Ed.). Schloss Dagstuhl - Leibniz-Zentrum fü r Informatik, 6:1--6:20. https://doi.org/10.4230/LIPIcs.ITCS.2019.6
[7]
Albert Atserias, Martin Grohe, and Dániel Marx. 2008. Size bounds and query plans for relational joins. In Foundations of Computer Science, 2008. FOCS'08. IEEE 49th Annual IEEE Symposium on. IEEE, 739--748.
[8]
Walter Cai, Magdalena Balazinska, and Dan Suciu. 2019. Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities. In Proceedings of the 2019 International Conference on Management of Data. ACM, 18--35.
[9]
Xiaowei Chen and John C. S. Lui. 2016. Mining Graphlet Counts in Online Social Networks. In IEEE 16th International Conference on Data Mining, ICDM 2016, December 12--15, 2016, Barcelona, Spain, Francesco Bonchi, Josep Domingo-Ferrer, Ricardo Baeza-Yates, Zhi-Hua Zhou, and Xindong Wu (Eds.). IEEE Computer Society, 71--80. https://doi.org/10.1109/ICDM.2016.0018
[10]
Talya Eden, Amit Levi, Dana Ron, and C Seshadhri. 2017. Approximately counting triangles in sublinear time. SIAM J. Comput., Vol. 46, 5 (2017), 1603--1646.
[11]
Talya Eden, Dana Ron, and C Seshadhri. 2018. On approximating the number of k-cliques in sublinear time. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. ACM, 722--734.
[12]
Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, and Panos Kalnis. 2014. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment, Vol. 7, 7 (2014), 517--528.
[13]
Mathias Fiedler and Christian Borgelt. 2007. Subgraph Support in a Single Large Graph. In Workshops Proceedings of the 7th IEEE International Conference on Data Mining (ICDM 2007), October 28--31, 2007, Omaha, Nebraska, USA. IEEE Computer Society, 399--404. https://doi.org/10.1109/ICDMW.2007.74
[14]
Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A benchmark for OWL knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 3, 2 (2005), 158--182.
[15]
Myoungji Han, Hyunjoon Kim, Geonmo Gu, Kunsoo Park, and Wook-Shin Han. 2019. Efficient Subgraph Matching: Harmonizing Dynamic Programming, Adaptive Matching Order, and Failing Set Together. In Proceedings of the 2019 International Conference on Management of Data. ACM, 1429--1446.
[16]
Chuntao Jiang, Frans Coenen, and Michele Zito. 2013. A survey of frequent subgraph mining algorithms. Knowledge Engineering Review, Vol. 28, 1 (2013), 75--105.
[17]
Chathura Kankanamge, Siddhartha Sahu, Amine Mhedbhi, Jeremy Chen, and Semih Salihoglu. 2017. Graphflow: An Active Graph Database. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1695--1698.
[18]
Jinha Kim, Hyungyu Shin, Wook-Shin Han, Sungpack Hong, and Hassan Chafi. 2015. Taming subgraph isomorphism for RDF query processing. Proceedings of the VLDB Endowment, Vol. 8, 11 (2015), 1238--1249.
[19]
Michihiro Kuramochi and George Karypis. 2005. Finding Frequent Patterns in a Large Sparse Graph(^mbox* ). Data Min. Knowl. Discov., Vol. 11, 3 (2005), 243--271. https://doi.org/10.1007/s10618-005-0003--9
[20]
Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In Cidr .
[21]
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. ACM, 615--629.
[22]
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2017. Wander join and XDB: online aggregation via random walks. ACM SIGMOD Record, Vol. 46, 1 (2017), 33--40.
[23]
Angela Maduko, Kemafor Anyanwu, Amit Sheth, and Paul Schliekelman. 2008. Graph summaries for subgraph frequency estimation. In European Semantic Web Conference. Springer, 508--523.
[24]
Volker Markl, Nimrod Megiddo, Marcel Kutsch, Tam Minh Tran, P Haas, and Utkarsh Srivastava. 2005. Consistently estimating the selectivity of conjuncts of predicates. In Proceedings of the 31st international conference on Very large data bases . 373--384.
[25]
Amine Mhedhbi and Semih Salihoglu. 2019. Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins. Proc. VLDB Endow., Vol. 12, 11 (2019), 1692--1704. https://doi.org/10.14778/3342263.3342643
[26]
Guido Moerkotte, Thomas Neumann, and Gabriele Steidl. 2009. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment, Vol. 2, 1 (2009), 982--993.
[27]
Thomas Neumann and Guido Moerkotte. 2011. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In 2011 IEEE 27th International Conference on Data Engineering. IEEE, 984--994.
[28]
Hung Q Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2012. Worst-case optimal join algorithms. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. ACM, 37--48.
[29]
Yeonsu Park, Seongyun Ko, Sourav S Bhowmick, Kyoungmin Kim, Kijae Hong, and Wook-Shin Han. 2020. G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1099--1114.
[30]
Monroe Sirken and Iris Shimizu. 1999. Population based establishment sample surveys: The Horvitz-Thompson estimator. Survey Methodology, Vol. 25, 2 (1999), 187--192.
[31]
Giorgio Stefanoni, Boris Motik, and Egor V Kostylev. 2018. Estimating the cardinality of conjunctive queries over RDF data using graph summarisation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1043--1052.
[32]
Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2008. Yago: A large ontology from wikipedia and wordnet. Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 6, 3 (2008), 203--217.
[33]
Shixuan Sun and Qiong Luo. 2020. In-Memory Subgraph Matching: An In-depth Study. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1083--1098.
[34]
Carlos HC Teixeira, Alexandre J Fonseca, Marco Serafini, Georgos Siganos, Mohammed J Zaki, and Ashraf Aboulnaga. 2015. Arabesque: a system for distributed graph mining. In Proceedings of the 25th Symposium on Operating Systems Principles. 425--440.
[35]
David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P Chakkappen. 2015. Join size estimation subject to filter conditions. Proceedings of the VLDB Endowment, Vol. 8, 12 (2015), 1530--1541.
[36]
Xifeng Yan, Philip S. Yu, and Jiawei Han. 2004. Graph Indexing: A Frequent Structure-based Approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13--18, 2004, Gerhard Weikum, Arnd Christian Kö nig, and Stefan Deßloch (Eds.). ACM, 335--346. https://doi.org/10.1145/1007568.1007607
[37]
Li Zeng, Lei Zou, M. Tamer Ö zsu, Lin Hu, and Fan Zhang. 2020. GSI: GPU-friendly Subgraph Isomorphism. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20--24, 2020 . IEEE, 1249--1260. https://doi.org/10.1109/ICDE48307.2020.00112
[38]
Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random Sampling over Joins Revisited. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1525--1539.
[39]
Ruoyu Zou and Lawrence B Holder. 2010. Frequent subgraph mining on a single large graph using sampling techniques. In Proceedings of the eighth workshop on mining and learning with graphs . 171--178.

Cited By

View all
  • (2024)Color: A Framework for Applying Graph Coloring to Subgraph Cardinality EstimationProceedings of the VLDB Endowment10.14778/3705829.370583418:2(130-143)Online publication date: 1-Oct-2024
  • (2024)Cardinality Estimation of Subgraph Matching: A Filtering-Sampling ApproachProceedings of the VLDB Endowment10.14778/3654621.365463517:7(1697-1709)Online publication date: 30-May-2024
  • (2024)gSWORD: GPU-accelerated Sampling for Subgraph CountingProceedings of the ACM on Management of Data10.1145/36392882:1(1-26)Online publication date: 26-Mar-2024
  • Show More Cited By

Index Terms

  1. Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
      June 2021
      2969 pages
      ISBN:9781450383431
      DOI:10.1145/3448016
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cardinality estimation
      2. graph data
      3. sampling
      4. worst-case optimal

      Qualifiers

      • Research-article

      Funding Sources

      • National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)

      Conference

      SIGMOD/PODS '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)105
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Color: A Framework for Applying Graph Coloring to Subgraph Cardinality EstimationProceedings of the VLDB Endowment10.14778/3705829.370583418:2(130-143)Online publication date: 1-Oct-2024
      • (2024)Cardinality Estimation of Subgraph Matching: A Filtering-Sampling ApproachProceedings of the VLDB Endowment10.14778/3654621.365463517:7(1697-1709)Online publication date: 30-May-2024
      • (2024)gSWORD: GPU-accelerated Sampling for Subgraph CountingProceedings of the ACM on Management of Data10.1145/36392882:1(1-26)Online publication date: 26-Mar-2024
      • (2024)Generalized Measure-Biased Sampling and Priority SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334067336:11(6251-6265)Online publication date: Nov-2024
      • (2024)LearnSC: An Efficient and Unified Learning-Based Framework for Subgraph Counting Problem2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00206(2625-2638)Online publication date: 13-May-2024
      • (2023)Accurate Summary-based Cardinality Estimation Through the Lens of Cardinality Estimation GraphsACM SIGMOD Record10.1145/3604437.360445852:1(94-102)Online publication date: 8-Jun-2023
      • (2023)A General Cardinality Estimation Framework for Subgraph Matching in Property GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.316132835:6(5485-5505)Online publication date: 1-Jun-2023

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media