Abstract
For exploratory data analysis, it is often desirable to know what answers you are likely to get before actually obtaining those answers. This can potentially be achieved by designing systems to offer the estimates of a data operation result-say op(data)-earlier in the process based on partial data processing. Those estimates continuously refine as more data is processed and finally converge to the exact answer. Unfortunately, the existing techniques-called Online Aggregation (OLA)-are limited to a single operation; that is, we cannot obtain the estimates for op(op(data)) or op(...(op(data))). If this Deep OLA becomes possible, data analysts will be able to explore data more interactively using complex cascade operations.
In this work, we take a step toward Deep OLA with evolving data frames (edf), a novel data model to offer OLA for nested ops-op(...(op(data)))-by representing an evolving structured data (with converging estimates) that is closed under set operations. That is, op(edf) produces yet another edf; thus, we can freely apply successive operations to edf and obtain an OLA output for each op. We evaluate its viability with Wake, an edf-based OLA system, by examining against state-of-the-art OLA and non-OLA systems. In our experiments on TPC-H dataset, Wake produces its first estimates 4.93× faster (median)-with 1.3× median slowdown for exact answers-compared to conventional systems. Besides its generality, Wake is also 1.92× faster (median) than existing OLA systems in producing estimates of under 1% relative errors.
Supplemental Material
- Accessed: 2022--10-01. MySQL 8.0 Reference - FIND_IN_SET. https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_find-in-set.Google Scholar
- Accessed: 2022--10-01. MySQL 8.0 Reference - GROUP_CONCAT. https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_group-concat.Google Scholar
- Accessed: 2022--10--15. Apache Arrow. https://arrow.apache.org/.Google Scholar
- Accessed: 2022--10--15. Apache Parquet. https://parquet.apache.org/.Google Scholar
- Accessed: 2022--10--15. ProgressiveDB. https://github.com/DataManagementLab/progressiveDB.Google Scholar
- Accessed: 2022--10--15. TPC-H: Decision Support Benchmark. https://www.tpc.org/tpch/.Google Scholar
- Accessed: 2022--10--15. XDB: approXimate DataBase (XDB). https://github.com/InitialDLab/XDB.Google Scholar
- Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. 29--42.Google ScholarDigital Library
- Divyakant Agrawal, Amr El Abbadi, Ambuj Singh, and Tolga Yurek. 1997. Efficient view maintenance at data warehouses. ACM SIGMOD Record 26, 2 (1997), 417--427.Google ScholarDigital Library
- Yanif Ahmad and Christoph Koch. 2009. DBToaster: A SQL compiler for high-performance delta processing in main-memory databases. Proceedings of the VLDB Endowment 2, 2 (2009), 1566--1569.Google ScholarDigital Library
- Brian Babcock, Surajit Chaudhuri, and Gautam Das. 2003. Dynamic sample selection for approximate query processing. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 539--550.Google ScholarDigital Library
- Johes Bater, Yongjoo Park, Xi He, Xiao Wang, and Jennie Rogers. 2020. SAQE: practical privacy-preserving approximate query processing for data federations. Proceedings of the VLDB Endowment 13, 12 (2020), 2691--2705.Google ScholarDigital Library
- Lukas Berg, Tobias Ziegler, Carsten Binnig, and Uwe Röhm. 2019. ProgressiveDB: progressive data analytics as a middleware. Proceedings of the VLDB Endowment 12, 12 (2019), 1814--1817.Google ScholarDigital Library
- Jose A Blakeley, Per-Ake Larson, and Frank Wm Tompa. 1986. Efficiently updating materialized views. ACM SIGMOD Record 15, 2 (1986), 61--71.Google ScholarDigital Library
- Leonid V. Bogachev, Alexander V. Gnedin, and Yuri V. Yakubovich. 2008. On the variance of the number of occupied boxes. Advances in Applied Mathematics 40, 4 (2008), 401--432. https://doi.org/10.1016/j.aam.2007.05.002Google ScholarDigital Library
- Kaushik Chakrabarti, Minos Garofalakis, Rajeev Rastogi, and Kyuseok Shim. 2001. Approximate query processing using wavelets. The VLDB Journal 10, 2 (2001), 199--223.Google ScholarDigital Library
- Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems (TODS) 32, 2 (2007), 9--es.Google ScholarDigital Library
- Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate query processing: No silver bullet. In Proceedings of the 2017 ACM International Conference on Management of Data. 511--519.Google ScholarDigital Library
- Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 511--519. https://doi.org/10.1145/3035918.3056097Google ScholarDigital Library
- Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. 1995. Optimizing queries with materialized views. In Proceedings of the Eleventh International Conference on Data Engineering. IEEE, 190--200.Google ScholarCross Ref
- Shimin Chen, Phillip B Gibbons, and Suman Nath. 2010. Pr-join: a non-blocking join achieving higher early result rate with statistical guarantees. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 147--158.Google ScholarDigital Library
- Supawit Chockchowwat, Wenjie Liu, and Yongjoo Park. 2022. Automatically Finding Optimal Index Structure. arXiv preprint arXiv:2208.03823 (2022).Google Scholar
- Supawit Chockchowwat, Chaitanya Sood, and Yongjoo Park. 2022. Airphant: Cloud-oriented Document Indexing. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 1368--1381.Google Scholar
- Tyson Condie, Neil Conway, Peter Alvaro, Joseph M Hellerstein, John Gerth, Justin Talbot, Khaled Elmeleegy, and Russell Sears. 2010. Online aggregation and continuous query support in mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 1115--1118.Google ScholarDigital Library
- Graham Cormode. 2011. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers (2011), 15.Google Scholar
- Graham Cormode, Minos Garofalakis, Peter J Haas, Chris Jermaine, et al. 2011. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases 4, 1--3 (2011), 1--294.Google Scholar
- Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2016. The case for interactive data exploration accelerators (IDEAs). In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1--6.Google ScholarDigital Library
- Herbert A David and Haikady N Nagaraja. 2004. Order statistics. John Wiley & Sons.Google Scholar
- Jens-Peter Dittrich, Bernhard Seeger, David Scot Taylor, and Peter Widmayer. 2002. Progressive merge join: a generic and non-blocking sort-based join algorithm. In Proceedings of the 28th international conference on Very Large Data Bases. 299--310.Google Scholar
- Bradley Efron. 1979. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1 (1979), 1 -- 26. https://doi.org/10.1214/aos/1176344552Google ScholarCross Ref
- Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. Discrete Mathematics and Theoretical Computer Science, 137--156.Google Scholar
- Jonathan Goldstein and Per-Åke Larson. 2001. Optimizing queries using materialized views: a practical, scalable solution. ACM SIGMOD Record 30, 2 (2001), 331--342.Google ScholarDigital Library
- Ashish Gupta, Inderpal Singh Mumick, and Venkatramanan Siva Subrahmanian. 1993. Maintaining views incrementally. ACM SIGMOD Record 22, 2 (1993), 157--166.Google ScholarDigital Library
- Peter J Haas and Joseph M Hellerstein. 1999. Ripple joins for online aggregation. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 287--298.Google ScholarDigital Library
- Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In VLDB. Morgan Kaufmann, 311--322.Google Scholar
- Fumio Hayashi. 2000. Econometrics. Princeton University Press. 27--32 pages.Google Scholar
- Wen He, Yongjoo Park, Idris Hanafi, Jacob Yatvitskiy, and Barzan Mozafari. 2018. Demonstration of VerdictDB, the platform-independent AQP system. In Proceedings of the 2018 International Conference on Management of Data. 1665--1668.Google ScholarDigital Library
- Joseph M Hellerstein, Peter J Haas, and Helen J Wang. 1997. Online aggregation. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data. 171--182.Google ScholarDigital Library
- Ian Hellström. Accessed: 2022--10-01. Oracle SQL & PL/SQL Optimization for Developers. https://oracle.readthedocs.io/en/latest/sql/joins/hash-join.html.Google Scholar
- Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. Deepdb: Learn from data, not from queries! arXiv preprint arXiv:1909.00607 (2019).Google Scholar
- Yannis E Ioannidis and Viswanath Poosala. 1999. Histogram-based approximation of set-valued query-answers. In VLDB, Vol. 99. 174--185.Google Scholar
- Chris Jermaine, Subramanian Arumugam, Abhijit Pol, and Alin Dobra. 2008. Scalable approximate query processing with the dbo engine. ACM Transactions on Database Systems (TODS) 33, 4 (2008), 1--54.Google ScholarDigital Library
- Christopher Jermaine, Alin Dobra, Subramanian Arumugam, Shantanu Joshi, and Abhijit Pol. 2006. The sort-merge-shrink join. ACM Transactions on Database Systems (TODS) 31, 4 (2006), 1382--1416.Google ScholarDigital Library
- Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. QuickR: Lazily approximating complex adhoc queries in bigdata clusters. In Proceedings of the 2016 international conference on management of data. 631--646.Google ScholarDigital Library
- Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, and Ronitt Rubinfeld. 2015. Rapid sampling for visualizations with ordering guarantees. In Proceedings of the vldb endowment international conference on very large data bases, Vol. 8. NIH Public Access, 521.Google ScholarDigital Library
- Samuel Kotz and Saralees Nadarajah. 2000. Extreme Value Distributions. PUBLISHED BY IMPERIAL COLLEGE PRESS AND DISTRIBUTED BY WORLD SCIENTIFIC PUBLISHING CO. https://doi.org/10.1142/p191 arXiv:https://www.worldscientific.com/doi/pdf/10.1142/p191Google ScholarCross Ref
- Harry H. Ku. 2010. Notes on the Use of Propagation of Error Formulas. In Journal of Research of the National Bureau of Standards, Section C: Engineering and Instrumentation, Vol. 2.Google Scholar
- Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandier, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. arXiv preprint arXiv:1208.4173 (2012).Google Scholar
- Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615--629.Google ScholarDigital Library
- Zhaoheng Li, Xinyu Pi, and Yongjoo Park. 2023. S/C: Speeding up Data Materialization with Bounded Memory. In 2023 IEEE 39th international conference on data engineering (ICDE). IEEE.Google ScholarCross Ref
- Jie Liu, Wenqian Dong, Qingqing Zhou, and Dong Li. 2021. Fauce: fast and accurate deep ensembles with uncertainty for cardinality estimation. Proceedings of the VLDB Endowment 14, 11 (2021), 1950--1963.Google ScholarDigital Library
- Gang Luo, Curt J Ellmann, Peter J Haas, and Jeffrey F Naughton. 2002. A scalable hash ripple join algorithm. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 252--262.Google ScholarDigital Library
- Qingzhi Ma and Peter Triantafillou. 2019. Dbest: Revisiting approximate query processing engines with machine learning models. In Proceedings of the 2019 International Conference on Management of Data. 1553--1570.Google ScholarDigital Library
- Wes McKinney et al. 2011. pandas: a foundational Python library for data analysis and statistics. Python for high performance and scientific computing 14, 9 (2011), 1--9.Google Scholar
- Frank McSherry, Derek Gordon Murray, Rebecca Isaacs, and Michael Isard. 2013. Differential Dataflow.. In CIDR.Google Scholar
- John Meehan, Nesime Tatbul, Stan Zdonik, Cansu Aslantas, Ugur Cetintemel, Jiang Du, Tim Kraska, Samuel Madden, David Maier, Andrew Pavlo, et al . 2015. S-Store: Streaming Meets Transaction Processing. Proceedings of the VLDB Endowment 8, 13 (2015).Google Scholar
- Mohamed F Mokbel, Ming Lu, and Walid G Aref. 2004. Hash-merge join: A non-blocking join algorithm for producing fast and early join results. In Proceedings. 20th International Conference on Data Engineering. IEEE, 251--262.Google ScholarCross Ref
- Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 439--455.Google ScholarDigital Library
- Milos Nikolic, Mohammed Elseidy, and Christoph Koch. 2014. LINVIEW: incremental view maintenance for complex analytical queries. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 253--264.Google ScholarDigital Library
- Niketan Pansare, Vinayak Borkar, Chris Jermaine, and Tyson Condie. 2011. Online aggregation for large mapreduce jobs. Proceedings of the VLDB Endowment 4, 11 (2011), 1135--1145.Google ScholarDigital Library
- Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. 2018. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384 (2018).Google Scholar
- Yongjoo Park, Michael Cafarella, and Barzan Mozafari. 2015. Neighbor-Sensitive Hashing. Proceedings of the VLDB Endowment 9, 3 (2015), 144--155.Google ScholarDigital Library
- Yongjoo Park, Michael Cafarella, and Barzan Mozafari. 2016. Visualization-aware sampling for very large databases. In 2016 IEEE 32nd international conference on data engineering (ICDE). IEEE, 755--766.Google ScholarCross Ref
- Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: Universalizing approximate query processing. In Proceedings of the 2018 International Conference on Management of Data. 1461--1476.Google ScholarDigital Library
- Yongjoo Park, Jingyi Qing, Xiaoyang Shen, and Barzan Mozafari. 2019. BlinkML: Efficient maximum likelihood estimation with probabilistic guarantees. In Proceedings of the 2019 International Conference on Management of Data. 1135--1152.Google ScholarDigital Library
- Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari. 2017. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data. 587--602.Google ScholarDigital Library
- Yongjoo Park, Shucheng Zhong, and Barzan Mozafari. 2020. QuickSel: Quick selectivity learning with mixture models. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1017--1033.Google ScholarDigital Library
- Emanuel Parzen. 1962. On estimation of a probability density function and mode. The annals of mathematical statistics 33, 3 (1962), 1065--1076.Google Scholar
- pola rs. Accessed: 2022--10--14. Polars: Lightning-fast DataFrame library for Rust and Python. https://www.pola.rs/.Google Scholar
- Georg Pólya. 1920. Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung und das Momentenproblem. Mathematische Zeitschrift 8 (1920), 171--181.Google ScholarCross Ref
- Viswanath Poosala, Venkatesh Ganti, and Yannis E. Ioannidis. 1999. Approximate query answering using histograms. IEEE Data Eng. Bull. 22, 4 (1999), 5--14.Google Scholar
- Murray Rosenblatt. 1956. Remarks on some nonparametric estimates of a density function. The annals of mathematical statistics (1956), 832--837.Google Scholar
- Kenneth Salem, Kevin Beyer, Bruce Lindsay, and Roberta Cochrane. 2000. How to roll a join: Asynchronous incremental view maintenance. ACM SIGMOD Record 29, 2 (2000), 129--140.Google ScholarDigital Library
- Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, and Christopher Berner. 2019. Presto: SQL on Everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1802--1813. https://doi.org/10.1109/ICDE.2019.00196Google ScholarCross Ref
- Nikhil Sheoran, Supawit Chockchowwat, Arav Chheda, Suwen Wang, Riya Verma, and Yongjoo Park. 2022. A Step Toward Deep Online Aggregation (Extended Version). arXiv preprint arXiv:2303.04103 (2022).Google Scholar
- Nikhil Sheoran, Subrata Mitra, Vibhor Porwal, Siddharth Ghetia, Jatin Varshney, Tung Mai, Anup Rao, and Vikas Maddukuri. 2022. Conditional Generative Model Based Predicate-Aware Query Approximation. Proceedings of the AAAI Conference on Artificial Intelligence 36, 8 (Jun. 2022), 8259--8266. https://doi.org/10.1609/aaai.v36i8.20800Google ScholarCross Ref
- Yingjie Shi, Xiaofeng Meng, Fusheng Wang, and Yantao Gan. 2012. You Can Stop Early with COLA: Online Processing of Aggregate Queries in the Cloud. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (Maui, Hawaii, USA) (CIKM '12). Association for Computing Machinery, New York, NY, USA, 1223--1232. https://doi.org/10.1145/2396761.2398423Google ScholarDigital Library
- P. Tchébychef. 1867. Des valeurs moyennes (Traduction du russe, N. de Khanikof. Journal de Mathématiques Pures et Appliquées (1867), 177--184. http://eudml.org/doc/234989Google Scholar
- Saravanan Thirumuruganathan, Shohedul Hasan, Nick Koudas, and Gautam Das. 2020. Approximate query processing for data exploration using deep generative models. In 2020 IEEE 36th international conference on data engineering (ICDE). IEEE, 1309--1320.Google ScholarCross Ref
- Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD nternational conference on Management of data. 147--156.Google ScholarDigital Library
- Tolga Urhan and Michael J Franklin. 2000. XJoin: A Reactively-Scheduled Pipelined Join Operator. Bulletin of the Technical Committee on (2000), 27.Google Scholar
- A. W. van der Vaart. 1998. Asymptotic Statistics. Cambridge University Press. https://doi.org/10.1017/CBO9780511802256Google ScholarCross Ref
- Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2020. Are we ready for learned cardinality estimation? arXiv preprint arXiv:2012.06743 (2020).Google Scholar
- Sai Wu, Shouxu Jiang, Beng Chin Ooi, and Kian-Lee Tan. 2009. Distributed online aggregations. Proceedings of the VLDB Endowment 2, 1 (2009), 443--454.Google ScholarDigital Library
- Sai Wu, Beng Chin Ooi, and Kian-Lee Tan. 2010. Continuous sampling for online aggregation over multiple queries. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 651--662.Google ScholarDigital Library
- Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. In Advances in Neural Information Processing Systems.Google Scholar
- Jian Yang, Kamalakar Karlapalem, and Qing Li. 1997. Algorithms for materialized view design in data warehousing environment. In VLDB, Vol. 97. 136--145.Google Scholar
- Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: one cardinality estimator for all tables. arXiv preprint arXiv:2006.08109 (2020).Google Scholar
- Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. arXiv preprint arXiv:1905.04278 (2019).Google Scholar
- Kai Zeng, Sameer Agarwal, Ankur Dave, Michael Armbrust, and Ion Stoica. 2015. G-OLA: Generalized on-line aggregation for interactive analysis on big data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 913--918.Google ScholarDigital Library
- Kai Zeng, Sameer Agarwal, and Ion Stoica. 2016. IOLAP: Managing uncertainty for efficient incremental OLAP. In Proceedings of the 2016 international conference on management of data. 1347--1361.Google ScholarDigital Library
- Meifan Zhang and Hongzhi Wang. 2021. Approximate query processing for group-by queries based on conditional generative models. arXiv preprint arXiv:2101.02914 (2021).Google Scholar
- Marcin Zukowski, Mark Van de Wiel, and Peter Boncz. 2012. Vectorwise: A vectorized analytical DBMS. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 1349--1350.Google ScholarDigital Library
Index Terms
- A Step Toward Deep Online Aggregation
Recommendations
Sampling estimators for parallel online aggregation
BNCOD'13: Proceedings of the 29th British National conference on Big DataOnline aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. When coupled with parallel ...
Comments