Abstract
Interactive applications require processing tens to hundreds of concurrent analytical queries within tight time constraints. In such setups, where high concurrency causes contention, work-sharing databases are critical for improving scalability and for bounding the increase in response time. However, as such databases share data access using full scans and expensive shared filters, they suffer from a data-access bottleneck that jeopardizes interactivity.
We present SH2O: a novel data-access operator that addresses the data-access bottleneck of work-sharing databases. SH2O is based on the idea that an access pattern based on judiciously selected multidimensional ranges can replace a set of shared filters. To exploit the idea in an efficient and scalable manner, SH2O uses a three-tier approach: i) it uses spatial indices to efficiently access the ranges without overfetching, ii) it uses an optimizer to choose which filters to replace such that it maximizes cost-benefit for index accesses, and iii) it exploits partitioning schemes and independently accesses each data partition to reduce the number of filters in the access pattern. Furthermore, we propose a tuning strategy that chooses a partitioning and indexing scheme that minimizes SH2O's cost for a target workload. Our evaluation shows a speedup of 1.8-22.2 for batches of hundreds of data-access-bound queries.
- Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. 2008. Column-Stores vs. Row-Stores: How Different Are They Really?. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD '08). Association for Computing Machinery, New York, NY, USA, 967--980. https://doi.org/10.1145/1376616.1376712Google ScholarDigital Library
- Sanjay Agrawal, Nicolas Bruno, Surajit Chaudhuri, and Vivek R Narasayya. 2006. AutoAdmin: Self-Tuning Database SystemsTechnology. IEEE Data Eng. Bull. , Vol. 29, 3 (2006), 7--15.Google Scholar
- Subi Arumugam, Alin Dobra, Christopher M Jermaine, Niketan Pansare, and Luis Perez. 2010. The DataPath system: a data-centric analytic processing engine for large data warehouses. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 519--530.Google ScholarDigital Library
- Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM, Vol. 18, 9 (sep 1975), 509--517. https://doi.org/10.1145/361002.361007Google ScholarDigital Library
- George Candea, Neoklis Polyzotis, and Radek Vingralek. 2009. A scalable, predictable join operator for highly concurrent data warehouses. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB).Google ScholarDigital Library
- Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew McCormick, Aniket Mokashi, Paul Harvey, Hector Gonzalez, David Lomax, Sagar Mittal, Roee Aharon Ebenstein, Nikita Mikhaylin, Hung ching Lee, Xiaoyan Zhao, Guanzhong Xu, Luis Antonio Perez, Farhad Shahmohammadi, Tran Bui, Neil McKay, Vera Lychagina, and Brett Elliott. 2019. Procella: Unifying serving and analytical data at YouTube. PVLDB , Vol. 12(12) (2019), 2022--2034. https://dl.acm.org/citation.cfm?id=3360438Google Scholar
- Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity Estimation for Range Predicates Using Lightweight Models. Proc. VLDB Endow. , Vol. 12, 9 (May 2019), 1044--1057. https://doi.org/10.14778/3329772.3329780Google ScholarDigital Library
- Peter M. Fischer and Donald Kossmann. 2005. Batched Processing for Information Filters. In Proceedings of the 21st International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 902--913. https://doi.org/10.1109/ICDE.2005.25Google ScholarDigital Library
- Georgios Giannikis. 2014. Work Sharing Data Processing Systems. Ph.,D. Dissertation. ETH Zurich, Zü rich, Switzerland. https://doi.org/10.3929/ethz-a-010265242Google ScholarCross Ref
- Georgios Giannikis, Gustavo Alonso, and Donald Kossmann. 2012. SharedDB: killing one thousand queries with one stone. arXiv preprint arXiv:1203.0056 (2012).Google Scholar
- Goetz Graefe. 2009. Fast loads and fast queries. In International Conference on Data Warehousing and Knowledge Discovery. Springer, 111--124.Google ScholarDigital Library
- Antonin Guttman. 1984. R-Trees: A Dynamic Index Structure for Spatial Searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (Boston, Massachusetts) (SIGMOD '84). Association for Computing Machinery, New York, NY, USA, 47--57. https://doi.org/10.1145/602259.602266Google ScholarDigital Library
- Stavros Harizopoulos, Vladislav Shkapenyuk, and Anastassia Ailamaki. 2005. Qpipe: A simultaneously pipelined relational query engine. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 383--394.Google ScholarDigital Library
- Stratos Idreos, F. Groffen, Niels Nes, Stefan Manegold, Sjoerd Mullender, and Martin Kersten. 2012. MonetDB: Two Decades of Research in Column-oriented Database Architectures. IEEE Data Eng. Bull. , Vol. 35 (01 2012).Google Scholar
- Panos Kalnis, Nikos Mamoulis, and Dimitris Papadias. 2002. View selection using randomized search. Data & Knowledge Engineering , Vol. 42, 1 (2002), 89--111.Google ScholarDigital Library
- Srikanth Kandula, Laurel Orr, and Surajit Chaudhuri. 2019. Pushing Data-Induced Predicates through Joins in Big-Data Clusters. Proc. VLDB Endow. , Vol. 13, 3 (nov 2019), 252--265. https://doi.org/10.14778/3368289.3368292Google ScholarDigital Library
- Donghe Kang, Ruochen Jiang, and Spyros Blanas. 2021. Jigsaw: A data storage and query processing engine for irregular table partitioning. In Proceedings of the 2021 International Conference on Management of Data. 898--911.Google ScholarDigital Library
- Michael S Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access path selection in main-memory optimized data systems: Should I scan or should I probe?. In Proceedings of the 2017 ACM International Conference on Management of Data. 715--730.Google ScholarDigital Library
- Donald Kossmann and Konrad Stocker. 2000. Iterative Dynamic Programming: A New Class of Query Optimization Algorithms. ACM Trans. Database Syst. , Vol. 25, 1 (mar 2000), 43--82. https://doi.org/10.1145/352958.352982Google ScholarDigital Library
- Jonathan K. Lawder and Peter J. H. King. 2000. Using Space-Filling Curves for Multi-Dimensional Indexing. In Proceedings of the 17th British National Conferenc on Databases: Advances in Databases (BNCOD 17). Springer-Verlag, Berlin, Heidelberg, 20--35.Google Scholar
- Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment , Vol. 9, 3 (2015), 204--215.Google ScholarDigital Library
- Samuel Madden, Mehul Shah, Joseph M. Hellerstein, and Vijayshankar Raman. 2002. Continuously Adaptive Continuous Queries over Streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, Wisconsin) (SIGMOD '02). ACM, New York, NY, USA, 49--60. https://doi.org/10.1145/564691.564698Google ScholarDigital Library
- Darko Makreshanski, Georgios Giannikis, Gustavo Alonso, and Donald Kossmann. 2016. MQJoin: Efficient Shared Execution of Main-memory Joins. Proc. VLDB Endow. , Vol. 9, 6 (Jan. 2016), 480--491. https://doi.org/10.14778/2904121.2904124Google ScholarDigital Library
- Guido Moerkotte. 1998. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. In Proceedings of the 24rd International Conference on Very Large Data Bases (VLDB '98). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 476--487.Google Scholar
- Der Technischen Universität München and Volker Markl. 1999. MISTRAL: Processing Relational Queries using a Multidimensional Access Technique.Google Scholar
- Patrick E. O'Neil, Elizabeth J. O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. In TPCTC. 237--252.Google Scholar
- Apache Pinot. 2023. https://pinot.apache.org/.Google Scholar
- Lin Qiao, Vijayshankar Raman, Frederick Reiss, Peter J. Haas, and Guy M. Lohman. 2008. Main-Memory Scan Sharing for Multi-Core CPUs. Proc. VLDB Endow. , Vol. 1, 1 (aug 2008), 610--621. https://doi.org/10.14778/1453856.1453924Google ScholarDigital Library
- Mark Raasveldt and Hannes Mühleisen. 2019. Duckdb: an embeddable analytical database. In Proceedings of the 2019 International Conference on Management of Data. 1981--1984.Google ScholarDigital Library
- Robin Rehrmann, Carsten Binnig, Alexander Böhm, Kihong Kim, Wolfgang Lehner, and Amr Rizk. 2018. OLTPshare: The Case for Sharing in OLTP Workloads. Proc. VLDB Endow. , Vol. 11, 12 (aug 2018), 1769--1780. https://doi.org/10.14778/3229863.3229866Google ScholarDigital Library
- Nicholas Roussopoulos. 1982. View indexing in relational databases. ACM Transactions on Database Systems (TODS) , Vol. 7, 2 (1982), 258--290.Google ScholarDigital Library
- Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, and Christopher Berner. 2019. Presto: SQL on Everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1802--1813. https://doi.org/10.1109/ICDE.2019.00196Google ScholarCross Ref
- Panagiotis Sioulas and Anastasia Ailamaki. 2021. Scalable Multi-Query Execution using Reinforcement Learning. In Proceedings of the 2021 International Conference on Management of Data. 1651--1663.Google ScholarDigital Library
- Liwen Sun, Michael J Franklin, Sanjay Krishnan, and Reynold S Xin. 2014. Fine-grained partitioning for aggressive data skipping. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1115--1126.Google ScholarDigital Library
- Liwen Sun, Michael J Franklin, Jiannan Wang, and Eugene Wu. 2016. Skipping-oriented partitioning for columnar layouts. Proceedings of the VLDB Endowment , Vol. 10, 4 (2016), 421--432.Google ScholarDigital Library
- P. Unterbrunner, G. Giannikis, G. Alonso, D. Fauser, and D. Kossmann. 2009. Predictable Performance for Unpredictable Workloads. Proc. VLDB Endow. , Vol. 2, 1 (aug 2009), 706--717. https://doi.org/10.14778/1687627.1687707Google ScholarDigital Library
- Zongheng Yang, Badrish Chandramouli, Chi Wang, Johannes Gehrke, Yinan Li, Umar Farooq Minhas, Per-Åke Larson, Donald Kossmann, and Rajeev Acharya. 2020. Qd-tree: Learning data layouts for big data analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 193--208.Google ScholarDigital Library
- Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow. , Vol. 13, 3 (nov 2019), 279--292. https://doi.org/10.14778/3368289.3368294Google ScholarDigital Library
- Jingren Zhou, Per-Ake Larson, Jonathan Goldstein, and Luping Ding. 2007. Dynamic materialized views. In 2007 IEEE 23rd International Conference on Data Engineering. IEEE, 526--535.Google ScholarCross Ref
- Marcin Zukowski, Sándor Héman, Niels Nes, and Peter Boncz. 2007. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS. In Proceedings of the 33rd International Conference on Very Large Data Bases (Vienna, Austria) (VLDB '07). VLDB Endowment, 723--734. ioGoogle Scholar
Index Terms
- SH2O: Efficient Data Access for Work-Sharing Databases
Recommendations
An Efficient Multiversion Access Structure
An efficient multiversion access structure for a transaction-time database is presented. Our method requires optimal storage and query times for several important queries and logarithmic update times. Three version operations inserts, updates, and ...
Reactive and proactive sharing across concurrent analytical queries
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataToday an ever increasing amount of data is collected and analyzed by researchers, businesses, and scientists in data warehouses (DW). In addition to the data size, the number of users and applications querying data grows exponentially. The increasing ...
Comments