LRP: learned robust data partitioning for efficient processing of large dynamic queries

Liu, Pengju; Cai, Pan; Zhong, Kai; Li, Cuiping; Chen, Hong

doi:10.1007/s11704-024-40509-4

LRP: learned robust data partitioning for efficient processing of large dynamic queries

Research Article
Published: 22 January 2025

Volume 19, article number 199607, (2025)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Pengju Liu^1,2,
Pan Cai^1,2,
Kai Zhong^1,2,
Cuiping Li^1,2 &
…
Hong Chen^1,2

67 Accesses
Explore all metrics

Abstract

The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution, primarily by minimizing the number of scanned block files. Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions, often overlooking non-numeric columns and thus limiting optimization potential. Additionally, these techniques, despite creating fine-grained partitions from representative queries to enhance system performance, experience from notable performance declines due to unpredictable fluctuations in future queries. To tackle these issues, we introduce LRP, a learned robust partitioning system for dynamic query processing. LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries. It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries. To create high-quality, robust partitions based on these predictions, LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions. Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning-based page replacement scheme for efficient I/O processing

Article Open access 08 February 2025

Lero: applying learning-to-rank in query optimizer

Article 25 April 2024

An Efficient LRnLA Algorithm and Data Structure for Manycore and Multicore Computers with Hierarchical Cache

References

Taylor R W, Sacca D, Wiederhold G. Database partitioning in a cluster of processors. ACM Transactions on Database Systems (TODS), 1985, 10(1): 29–56
Article MATH Google Scholar
Copeland G, Alexander W, Boughter E, Keller T. Data placement in bubba. In: Proceedings of 1988 ACM SIGMOD International Conference on Management of Data. 1988, 99–108
Chapter MATH Google Scholar
Stöhr T, Märtens H, Rahm E. Multi-dimensional database allocation for parallel data warehouses. In: Proceedings of the 26th International Conference on Very Large Data Bases. 2000, 273–284
MATH Google Scholar
Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18(9): 509–517
Article MATH Google Scholar
Zhan C, Su M, Wei C, Peng X, Lin L, Wang S, Chen Z, Li F, Pan Y, Zheng F, Chai C. AnalyticDB: real-time OLAP database system at alibaba cloud. Proceedings of the VLDB Endowment, 2019, 12(12): 2059–2070
Article Google Scholar
Papadomanolakis S, Ailamaki A. AutoPart: automating schema design for large scientific databases using data partitioning. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management. 2004, 383–392
MATH Google Scholar
Sun L, Franklin M J, Krishnan S, Xin R S. Fine-grained partitioning for aggressive data skipping. In: Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. 2014, 1115–1126
Chapter MATH Google Scholar
Ang Z, Chandramouli B, Wang C, Gehrke J, Li Y, Minhas U F, Larson PÅ, Kossmann D, Acharya R. Qd-tree: learning data layouts for big data analytics. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 193–208
Chapter Google Scholar
Li Z, Yiu M L, Chan T N. PAW: data partitioning meets workload variance. In: Proceedings of the 38th IEEE International Conference on Data Engineering. 2022, 123–135
MATH Google Scholar
Sun L, Franklin M J, Wang J, Wu E. Skipping-oriented partitioning for columnar layouts. Proceedings of the VLDB Endowment, 2016, 10(4): 421–432
Article MATH Google Scholar
Li C, Markl V, Aly A M, Mahmood A R, Hassan M S, Aref W G, Ouzzani M, Elmeleegy H, Qadah T. AQWA: adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment, 2015, 8(13): 2062–2073
Article Google Scholar
Aly A M, Elmeleegy H, Qi Y, Aref W. Kangaroo: workload-aware processing of range data and range queries in hadoop. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016, 397–406
MATH Google Scholar
Lu Y, Shanbhag A, Jindal A, Madden S. AdaptDB: adaptive partitioning for distributed joins. Proceedings of the VLDB Endowment, 2017, 10(5): 589–600
Article MATH Google Scholar
Ding J, Minhas U F, Chandramouli B, Wang C, Li Y, Li Y, Kossmann D, Gehrke J, Kraska T. Instance-optimized data layouts for cloud analytics workloads. In: Proceedings of 2021 International Conference on Management of Data. 2021, 418–431
Chapter MATH Google Scholar
TPC-H benchmark. See tpc.org/tpch/ website, 1999.
Rosenblatt F. Principles of neurodynamics: perceptrons and the theory of brain mechanisms. The American Journal of Psychology, 1963, 76(4): 705–707
Article MATH Google Scholar
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780
Article MATH Google Scholar
Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST). 2010, 1–10
Google Scholar
Shanbhag A, Jindal A, Madden S, Quiane J, Elmore A J. A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of 2017 Symposium on Cloud Computing. 2017, 229–241
Chapter MATH Google Scholar
Huang D, Liu Q, Cui Q, Fang Z, Ma X, Xu F, Shen L, Tang L, Zhou Y, Huang M, Wei W, Liu C, Zhang J, Li J, Wu X, Song L, Sun R, Yu S, Zhao L, Cameron N, Pei L, Tang X. TIDB: a raft-based HTAP database. Proceedings of the VLDB Endowment, 2020, 13(12): 3072–3084
Article Google Scholar
ClickHouse: an open-source columnar database management system. See clickhouse.com/docs/en/observability/managing-data website, 2016
Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J S, Lee A W, Motivala A, Munir A Q, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P. The snowflake elastic data warehouse. In: Proceedings of 2016 International Conference on Management of Data. 2016, 215–226
Chapter Google Scholar
Moerkotte G. Small materialized aggregates: a light weight index structure for data warehousing. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 476–487
MATH Google Scholar
Graefe G. Fast loads and fast queries. In: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery. 2009, 111–124
Chapter MATH Google Scholar
Kang D, Jiang R, Blanas S. Jigsaw: a data storage and query processing engine for irregular table partitioning. In: Proceedings of 2021 International Conference on Management of Data. 2021, 898–911
Chapter MATH Google Scholar
Han A, Yan X, Tao S, Anerousis N. Workload characterization and pre diction in the cloud: a multiple time series approach. In: Proceedings of 2012 IEEE Network Operations and Management Symposium. 2012, 1287–1294
MATH Google Scholar
Pavlo A, Angulo G, Arulraj J, Lin H, Lin J, Ma L, Menon P, Mowry T C, Perron M, Quah I, Santurkar S, Tomasic A, Toor S, Van Aken D, Wang Z, Wu Y, Xian R, Zhang T. Self-driving database management systems. In: Proceedings of the 8th Biennial Conference on Innovative Data Systems Research. 2017, 1
Google Scholar
Ma L, Van Aken D, Hefny A, Mezerhane G, Pavlo A, Gordon G J. Query-based workload forecasting for self-driving database management systems. In: Proceedings of 2018 International Conference on Management of Data. 2018, 631–645
Chapter Google Scholar
Hilprecht B, Binnig C, Röhm U. Learning a partitioning advisor for cloud databases. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 143–157
Chapter MATH Google Scholar
Zhou X, Li G, Feng J, Liu L, Guo W. Grep: a graph learning based database partitioning system. Proceedings of the ACM on Management of Data, 2023, 1(1): 94
Article MATH Google Scholar
Jindal A, Dittrich J. Relax and let the database do the partitioning online. In: Proceedings of the 5th International Workshop on Business Intelligence for the Real-Time Enterprise. 2011, 65–80
MATH Google Scholar
Wang J, Chai C, Liu J, Li G. Face: a normalizing flow based cardinality estimator. Proceedings of the VLDB Endowment, 2021, 15(1): 72–84
Article MATH Google Scholar
Hoerl A E, Kennard R W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 1970, 12(1): 55–67
Article MATH Google Scholar
Bertino E, Atzeni P, Tan K L, Chen Y, Tay Y C, Melnik S, Gubarev A, Long J J, Romer G, Shivakumar S, Tolton M, Vassilakis T. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 2010, 3(1–2): 330–339
Google Scholar
Ray: an open source framework to build and scale your ML and Python applications. See docs.ray.io/en/latest/ website, 2017
TPC-DS benchmark. See www.tpc.org/tpcds/ website, 2005
JOB benchmark. See developer.imdb.com/non-commercial-datasets/ website, 2016
ClickBench benchmark. See github.com/ClickHouse/ClickBench website, 2019

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (Grant No. 2023YFB4503600) and the National Natural Science Foundation of China (Grant Nos. U23A20299, 62072460, 62172424, 62276270, and 62322214).

Author information

Authors and Affiliations

Engineering Research Center of Database and Business Intelligence, Ministry of Education, Beijing, 100872, China
Pengju Liu, Pan Cai, Kai Zhong, Cuiping Li & Hong Chen
School of Information, Renmin University of China, Beijing, 100872, China
Pengju Liu, Pan Cai, Kai Zhong, Cuiping Li & Hong Chen

Authors

Pengju Liu
View author publications
Search author on:PubMed Google Scholar
Pan Cai
View author publications
Search author on:PubMed Google Scholar
Kai Zhong
View author publications
Search author on:PubMed Google Scholar
Cuiping Li
View author publications
Search author on:PubMed Google Scholar
Hong Chen
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Cuiping Li.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Pengju Liu is currently pursuing his PhD degree at the School of Information, Renmin University of China, China. He received his B.S. degree in information management and information system from Dalian Maritime University, China, in 2020. His research interests include adaptable database partitioning, and load forecasting.

Pan Cai is a PhD candidate at School of Information, Renmin University of China, China. Her research interests include query optimization, learned index, and machine learning. She is particularly interested in the structural design of learned indexes on multidimensional data.

Kai Zhong is a PhD student at School of Information, Renmin University of China, China, advised by Professor Cuiping Li. He received his BS degree in Information and Computing Science at School of Information, Huazhong Agricultural University, China in June 2022. His research lie in the field of AI4DB and Machine Learning.

Cuiping Li is currently a professor at the Renmin University of China, China. She received her PhD degree from Chinese Academy of Sciences, China in 2003. Prior to that, she earned her BS and MS degrees from Xi’an Jiaotong University, China in 1994 and 1997, respectively. She is a distinguished member of CCF. Her main research interests are machine learning for database and distributed query optimization.

Hong Chen is currently a professor at the Renmin University of China, China. She received her PhD degree from Chinese Academy of Sciences, China in 2000. Before that, she received her BS and MS degrees from Renmin University of China, China in 1986 and 1989, respectively. She received Second Prize of the National Award for Science and Technology Progress in 2018. She is a committee member of CCF and CIC. Her research interests include database technology and high-performance computing.

Electronic supplementary material