Abstract
The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution, primarily by minimizing the number of scanned block files. Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions, often overlooking non-numeric columns and thus limiting optimization potential. Additionally, these techniques, despite creating fine-grained partitions from representative queries to enhance system performance, experience from notable performance declines due to unpredictable fluctuations in future queries. To tackle these issues, we introduce LRP, a learned robust partitioning system for dynamic query processing. LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries. It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries. To create high-quality, robust partitions based on these predictions, LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions. Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.
Similar content being viewed by others
References
Taylor R W, Sacca D, Wiederhold G. Database partitioning in a cluster of processors. ACM Transactions on Database Systems (TODS), 1985, 10(1): 29–56
Copeland G, Alexander W, Boughter E, Keller T. Data placement in bubba. In: Proceedings of 1988 ACM SIGMOD International Conference on Management of Data. 1988, 99–108
Stöhr T, Märtens H, Rahm E. Multi-dimensional database allocation for parallel data warehouses. In: Proceedings of the 26th International Conference on Very Large Data Bases. 2000, 273–284
Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18(9): 509–517
Zhan C, Su M, Wei C, Peng X, Lin L, Wang S, Chen Z, Li F, Pan Y, Zheng F, Chai C. AnalyticDB: real-time OLAP database system at alibaba cloud. Proceedings of the VLDB Endowment, 2019, 12(12): 2059–2070
Papadomanolakis S, Ailamaki A. AutoPart: automating schema design for large scientific databases using data partitioning. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management. 2004, 383–392
Sun L, Franklin M J, Krishnan S, Xin R S. Fine-grained partitioning for aggressive data skipping. In: Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. 2014, 1115–1126
Ang Z, Chandramouli B, Wang C, Gehrke J, Li Y, Minhas U F, Larson PÅ, Kossmann D, Acharya R. Qd-tree: learning data layouts for big data analytics. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 193–208
Li Z, Yiu M L, Chan T N. PAW: data partitioning meets workload variance. In: Proceedings of the 38th IEEE International Conference on Data Engineering. 2022, 123–135
Sun L, Franklin M J, Wang J, Wu E. Skipping-oriented partitioning for columnar layouts. Proceedings of the VLDB Endowment, 2016, 10(4): 421–432
Li C, Markl V, Aly A M, Mahmood A R, Hassan M S, Aref W G, Ouzzani M, Elmeleegy H, Qadah T. AQWA: adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment, 2015, 8(13): 2062–2073
Aly A M, Elmeleegy H, Qi Y, Aref W. Kangaroo: workload-aware processing of range data and range queries in hadoop. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016, 397–406
Lu Y, Shanbhag A, Jindal A, Madden S. AdaptDB: adaptive partitioning for distributed joins. Proceedings of the VLDB Endowment, 2017, 10(5): 589–600
Ding J, Minhas U F, Chandramouli B, Wang C, Li Y, Li Y, Kossmann D, Gehrke J, Kraska T. Instance-optimized data layouts for cloud analytics workloads. In: Proceedings of 2021 International Conference on Management of Data. 2021, 418–431
TPC-H benchmark. See tpc.org/tpch/ website, 1999.
Rosenblatt F. Principles of neurodynamics: perceptrons and the theory of brain mechanisms. The American Journal of Psychology, 1963, 76(4): 705–707
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780
Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST). 2010, 1–10
Shanbhag A, Jindal A, Madden S, Quiane J, Elmore A J. A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of 2017 Symposium on Cloud Computing. 2017, 229–241
Huang D, Liu Q, Cui Q, Fang Z, Ma X, Xu F, Shen L, Tang L, Zhou Y, Huang M, Wei W, Liu C, Zhang J, Li J, Wu X, Song L, Sun R, Yu S, Zhao L, Cameron N, Pei L, Tang X. TIDB: a raft-based HTAP database. Proceedings of the VLDB Endowment, 2020, 13(12): 3072–3084
ClickHouse: an open-source columnar database management system. See clickhouse.com/docs/en/observability/managing-data website, 2016
Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J S, Lee A W, Motivala A, Munir A Q, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P. The snowflake elastic data warehouse. In: Proceedings of 2016 International Conference on Management of Data. 2016, 215–226
Moerkotte G. Small materialized aggregates: a light weight index structure for data warehousing. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 476–487
Graefe G. Fast loads and fast queries. In: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery. 2009, 111–124
Kang D, Jiang R, Blanas S. Jigsaw: a data storage and query processing engine for irregular table partitioning. In: Proceedings of 2021 International Conference on Management of Data. 2021, 898–911
Han A, Yan X, Tao S, Anerousis N. Workload characterization and pre diction in the cloud: a multiple time series approach. In: Proceedings of 2012 IEEE Network Operations and Management Symposium. 2012, 1287–1294
Pavlo A, Angulo G, Arulraj J, Lin H, Lin J, Ma L, Menon P, Mowry T C, Perron M, Quah I, Santurkar S, Tomasic A, Toor S, Van Aken D, Wang Z, Wu Y, Xian R, Zhang T. Self-driving database management systems. In: Proceedings of the 8th Biennial Conference on Innovative Data Systems Research. 2017, 1
Ma L, Van Aken D, Hefny A, Mezerhane G, Pavlo A, Gordon G J. Query-based workload forecasting for self-driving database management systems. In: Proceedings of 2018 International Conference on Management of Data. 2018, 631–645
Hilprecht B, Binnig C, Röhm U. Learning a partitioning advisor for cloud databases. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 143–157
Zhou X, Li G, Feng J, Liu L, Guo W. Grep: a graph learning based database partitioning system. Proceedings of the ACM on Management of Data, 2023, 1(1): 94
Jindal A, Dittrich J. Relax and let the database do the partitioning online. In: Proceedings of the 5th International Workshop on Business Intelligence for the Real-Time Enterprise. 2011, 65–80
Wang J, Chai C, Liu J, Li G. Face: a normalizing flow based cardinality estimator. Proceedings of the VLDB Endowment, 2021, 15(1): 72–84
Hoerl A E, Kennard R W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 1970, 12(1): 55–67
Bertino E, Atzeni P, Tan K L, Chen Y, Tay Y C, Melnik S, Gubarev A, Long J J, Romer G, Shivakumar S, Tolton M, Vassilakis T. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 2010, 3(1–2): 330–339
Ray: an open source framework to build and scale your ML and Python applications. See docs.ray.io/en/latest/ website, 2017
TPC-DS benchmark. See www.tpc.org/tpcds/ website, 2005
JOB benchmark. See developer.imdb.com/non-commercial-datasets/ website, 2016
ClickBench benchmark. See github.com/ClickHouse/ClickBench website, 2019
Acknowledgements
This work was supported by the National Key Research and Development Program of China (Grant No. 2023YFB4503600) and the National Natural Science Foundation of China (Grant Nos. U23A20299, 62072460, 62172424, 62276270, and 62322214).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.
Additional information
Pengju Liu is currently pursuing his PhD degree at the School of Information, Renmin University of China, China. He received his B.S. degree in information management and information system from Dalian Maritime University, China, in 2020. His research interests include adaptable database partitioning, and load forecasting.
Pan Cai is a PhD candidate at School of Information, Renmin University of China, China. Her research interests include query optimization, learned index, and machine learning. She is particularly interested in the structural design of learned indexes on multidimensional data.
Kai Zhong is a PhD student at School of Information, Renmin University of China, China, advised by Professor Cuiping Li. He received his BS degree in Information and Computing Science at School of Information, Huazhong Agricultural University, China in June 2022. His research lie in the field of AI4DB and Machine Learning.
Cuiping Li is currently a professor at the Renmin University of China, China. She received her PhD degree from Chinese Academy of Sciences, China in 2003. Prior to that, she earned her BS and MS degrees from Xi’an Jiaotong University, China in 1994 and 1997, respectively. She is a distinguished member of CCF. Her main research interests are machine learning for database and distributed query optimization.
Hong Chen is currently a professor at the Renmin University of China, China. She received her PhD degree from Chinese Academy of Sciences, China in 2000. Before that, she received her BS and MS degrees from Renmin University of China, China in 1986 and 1989, respectively. She received Second Prize of the National Award for Science and Technology Progress in 2018. She is a committee member of CCF and CIC. Her research interests include database technology and high-performance computing.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Liu, P., Cai, P., Zhong, K. et al. LRP: learned robust data partitioning for efficient processing of large dynamic queries. Front. Comput. Sci. 19, 199607 (2025). https://doi.org/10.1007/s11704-024-40509-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-024-40509-4