Abstract
Nowadays Parallel DBMSs and Spark SQL compete with each other to query Big Data. Parallel DBMSs feature extensive experience embodied by powerful data partitioning and data allocation algorithms, but they suffer when handling dynamic changes in query workload. On the other hand, Spark SQL has become a solution to process query workloads on big data, outside the DBMS realm. Unfortunately, Spark SQL incurs into significant random disk I/O cost, because there is no correlation detected between Spark jobs and data blocks read from the disk. In consequence, Spark fails at providing high performance in a dynamic analytic environment. To solve such limitation, we propose an adaptive query-aware framework for partitioning big data tables for query processing, based on a genetic optimization problem formulation. Our approach intensively rewrites queries by exploiting different dimension hierarchies that may exist among dimension attributes, skipping irrelevant data to improve I/O performance. We present an experimental validation on a Spark SQL parallel cluster, showing promising results.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Akal, F., Böhm, K., Schek, H.-J.: OLAP query evaluation in a database cluster: a performance study on intra-query parallelism. In: ADBIS, pp. 218–231 (2002)
Aken, D.V., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D. (eds.) ACM SIGMOD, pp. 1009–1024 (2017)
Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: ACM SIGMOD, pp. 1103–1114 (2014)
Asad, O., Kemme, B.: AdaptCache: adaptive data partitioning and migration for distributed object caches. In: Proceedings of the 17th International Middleware Conference, pp. 1–13 (2016)
Benkrid, S., Bellatreche, L.: A framework for designing autonomous parallel data warehouses. In: ICA3PP, pp. 97–104 (2019)
Benkrid, S., Mestoui, Y., Bellatreche, L., Ordonez, C.: A genetic optimization physical planner for big data warehouses. In: IEEE Big Data, pp. 406–412 (2020)
Bruno, N., Chaudhuri, S.: Automatic physical database tuning: a relaxation-based approach. In: ACM SIGMOD, pp. 227–238 (2005)
Durand, G.C., et al.: GridFormation: towards self-driven online data partitioning using reinforcement learning. In: aiDM Workshop, pp. 1–7 (2018)
Garcia-Alvarado, C., Ordonez, C.: Query processing on cubes mapped from ontologies to dimension hierarchies. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP, pp. 57–64 (2012)
Hilprecht, B., Binnig, C., Röhm, U.: Towards learning a partitioning advisor with deep reinforcement learning. In: aiDM Workshop, pp. 1–4 (2019)
Jindal, A., Karanasos, K., Rao, S., Patel, H.: Selecting subexpressions to materialize at datacenter scale. Proc. VLDB Endow. 11(7), 800–812 (2018)
Karanasos, K., et al.: Dynamically optimizing queries over large scale data platforms. In: ACM SIGMOD, pp. 943–954 (2014)
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018)
Kocsis, Z.A., Drake, J.H., Carson, D., Swan, J.: Automatic improvement of apache spark queries using semantics-preserving program reduction. In: GECCO, pp. 1141–1146 (2016)
Li, Y., Li, M., Ding, L., Interlandi, M.: RIOS: runtime integrated optimizer for spark. In: ACM Symposium on Cloud Computing, pp. 275–287 (2018)
Lima, A.A.B., Furtado, C., Valduriez, P., Mattoso, M.: Parallel OLAP query processing in database clusters with data replication. DaPD 25(1–2), 97–123 (2009)
Ma, L., Van Aken, D., Hefny, A., Mezerhane, G., Pavlo, A., Gordon, G.J.: Query-based workload forecasting for self-driving database management systems. In: ACM SIGMOD, pp. 631–645 (2018)
Nehme, R., Bruno, N.: Automated partitioning design in parallel database systems. In: ACM SIGMOD, pp. 1137–1148 (2011)
Quamar, A., Kumar, K.A., Deshpande, A.: SWORD: scalable workload-aware data placement for transactional workloads. In: EDBT, pp. 430–441 (2013)
Serafini, M., Taft, R., Elmore, A.J., Pavlo, A., Aboulnaga, A., Stonebraker, M.: Clay: fine-grained adaptive partitioning for general database schemas. VLDB Endow. 10(4), 445–456 (2016)
Stöhr, T., Märtens, H., Rahm, E.: Multi-dimensional database allocation for parallel data warehouses. In: VLDB, pp. 273–284 (2000)
Taft, R., et al.: E-store: fine-grained elastic partitioning for distributed transaction processing systems. VLDB Endow. 8(3), 245–256 (2014)
Zhang, T., Tomasic, A., Sheng, Y., Pavlo, A.: Performance of OLTP via intelligent scheduling. In: ICDE, pp. 1288–1291 (2018)
Zhang, W., Kim, J., Ross, K.A., Sedlar, E., Stadler, L.: Adaptive code generation for data-intensive analytics. Proc. VLDB Endow. 14(6), 929–942 (2021)
Zilio, D.C., et al.: Db2 design advisor: integrated automatic physical database design. In: VLDB, pp. 1087–1097 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Benkrid, S., Bellatreche, L., Mestoui, Y., Ordonez, C. (2021). Towards an Adaptive Multidimensional Partitioning for Accelerating Spark SQL. In: Golfarelli, M., Wrembel, R., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2021. Lecture Notes in Computer Science(), vol 12925. Springer, Cham. https://doi.org/10.1007/978-3-030-86534-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-86534-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86533-7
Online ISBN: 978-3-030-86534-4
eBook Packages: Computer ScienceComputer Science (R0)