research-article

Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation

Authors:

Kyoungmin Kim,

Hyeonji Kim,

George Fletcher,

Wook-Shin HanAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 964 - 976

https://doi.org/10.1145/3448016.3457246

Published: 18 June 2021 Publication History

Get Access

Abstract

Graph pattern cardinality estimation is the problem of estimating the number of embeddings of a query graph in a data graph. This fundamental problem arises, for example, during query planning in subgraph matching algorithms. There are two major approaches to solving the problem: sampling and synopsis. Synopsis (or summary)-based methods are fast and accurate if synopses capture information of graphs well. However, these methods suffer from large errors due to loss of information during summarization and inherent assumptions. Sampling-based methods are unbiased but suffer from large estimation variance due to large sample space. To address these limitations, we propose Alley, a hybrid method that combines both sampling and synopses. Alley employs 1) a novel sampling strategy, random walk with intersection, which effectively reduces the sample space, 2) branching to further reduce variance, and 3) a novel mining approach that extracts and indexes tangled patterns as synopses which are inherently difficult to estimate by sampling. By using them in the online estimation phase, we can effectively reduce the sample space while still ensuring unbiasedness. We establish that Alley has worst-case optimal runtime and approximation quality guarantees for any given error bound ε and required confidence μ. In addition to the theoretical aspect of Alley, our extensive experiments show that Alley outperforms the state-of-the-art methods by up to orders of magnitude higher accuracy with similar efficiency.

Supplementary Material

MP4 File (3448016.3457246.mp4)

Graph pattern cardinality estimation is the problem of estimating the number of embedding |M| of a query graph in a data graph. This fundamental problem arises, for example, during query planning in subgraph matching algorithms. There are two major approaches to solving the problem: sampling and synopsis. Synopsis (or summary)-based methods are fast and accurate if synopses capture information of graphs well. However, these methods suffer from large errors due to loss of information during summarization and inherent assumptions. Sampling-based methods are unbiased but suffer from large estimation variance due to large sample space. To address these limitations, we propose Alley, a hybrid method that combines both sampling and synopses. Alley employs 1) a novel sampling strategy, random walk with intersection, which effectively reduces the sample space, 2) branching to further reduce variance, and 3) a novel mining approach that extracts and indexes tangled patterns as synopses which are inherently difficult to estimate by sampling. By using them in the online estimation phase, we can effectively reduce the sample space while still ensuringun biasedness. We establish that Alley has worst-case optimal runtime and approximation quality guarantees. That is, for any given error bound and required confidence, Alley guarantees that the estimation error is bounded by with confidence ; if Z denotes the random variable for the estimate, Pr(|Z|M|| < |M|) > . In addition to the theoretical aspect of Alley, our extensive experiments show that Alley outperforms the state-of-the-art methods by up to orders of magnitude higher accuracy with similar efficiency.

Download
114.30 MB

References

[1]

Ehab Abdelhamid, Ibrahim Abdelaziz, Panos Kalnis, Zuhair Khayyat, and Fuad Jamour. 2016. Scalemine: Scalable parallel frequent subgraph mining in a single large graph. In SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 716--727.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Synopses for query optimization: A space-complexity perspective

Accurate Sampling-Based Cardinality Estimation for Complex Graph Queries

ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality Estimation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations