Skip to main content

Advertisement

Log in

LRP: learned robust data partitioning for efficient processing of large dynamic queries

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

The interconnection between query processing and data partitioning is pivotal for the acceleration of massive data processing during query execution, primarily by minimizing the number of scanned block files. Existing partitioning techniques predominantly focus on query accesses on numeric columns for constructing partitions, often overlooking non-numeric columns and thus limiting optimization potential. Additionally, these techniques, despite creating fine-grained partitions from representative queries to enhance system performance, experience from notable performance declines due to unpredictable fluctuations in future queries. To tackle these issues, we introduce LRP, a learned robust partitioning system for dynamic query processing. LRP first proposes a method for data and query encoding that captures comprehensive column access patterns from historical queries. It then employs Multi-Layer Perceptron and Long Short-Term Memory networks to predict shifts in the distribution of historical queries. To create high-quality, robust partitions based on these predictions, LRP adopts a greedy beam search algorithm for optimal partition division and implements a data redundancy mechanism to share frequently accessed data across partitions. Experimental evaluations reveal that LRP yields partitions with more stable performance under incoming queries and significantly surpasses state-of-the-art partitioning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Taylor R W, Sacca D, Wiederhold G. Database partitioning in a cluster of processors. ACM Transactions on Database Systems (TODS), 1985, 10(1): 29–56

    Article  MATH  Google Scholar 

  2. Copeland G, Alexander W, Boughter E, Keller T. Data placement in bubba. In: Proceedings of 1988 ACM SIGMOD International Conference on Management of Data. 1988, 99–108

    Chapter  MATH  Google Scholar 

  3. Stöhr T, Märtens H, Rahm E. Multi-dimensional database allocation for parallel data warehouses. In: Proceedings of the 26th International Conference on Very Large Data Bases. 2000, 273–284

    MATH  Google Scholar 

  4. Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18(9): 509–517

    Article  MATH  Google Scholar 

  5. Zhan C, Su M, Wei C, Peng X, Lin L, Wang S, Chen Z, Li F, Pan Y, Zheng F, Chai C. AnalyticDB: real-time OLAP database system at alibaba cloud. Proceedings of the VLDB Endowment, 2019, 12(12): 2059–2070

    Article  Google Scholar 

  6. Papadomanolakis S, Ailamaki A. AutoPart: automating schema design for large scientific databases using data partitioning. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management. 2004, 383–392

    MATH  Google Scholar 

  7. Sun L, Franklin M J, Krishnan S, Xin R S. Fine-grained partitioning for aggressive data skipping. In: Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. 2014, 1115–1126

    Chapter  MATH  Google Scholar 

  8. Ang Z, Chandramouli B, Wang C, Gehrke J, Li Y, Minhas U F, Larson PÅ, Kossmann D, Acharya R. Qd-tree: learning data layouts for big data analytics. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 193–208

    Chapter  Google Scholar 

  9. Li Z, Yiu M L, Chan T N. PAW: data partitioning meets workload variance. In: Proceedings of the 38th IEEE International Conference on Data Engineering. 2022, 123–135

    MATH  Google Scholar 

  10. Sun L, Franklin M J, Wang J, Wu E. Skipping-oriented partitioning for columnar layouts. Proceedings of the VLDB Endowment, 2016, 10(4): 421–432

    Article  MATH  Google Scholar 

  11. Li C, Markl V, Aly A M, Mahmood A R, Hassan M S, Aref W G, Ouzzani M, Elmeleegy H, Qadah T. AQWA: adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment, 2015, 8(13): 2062–2073

    Article  Google Scholar 

  12. Aly A M, Elmeleegy H, Qi Y, Aref W. Kangaroo: workload-aware processing of range data and range queries in hadoop. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016, 397–406

    MATH  Google Scholar 

  13. Lu Y, Shanbhag A, Jindal A, Madden S. AdaptDB: adaptive partitioning for distributed joins. Proceedings of the VLDB Endowment, 2017, 10(5): 589–600

    Article  MATH  Google Scholar 

  14. Ding J, Minhas U F, Chandramouli B, Wang C, Li Y, Li Y, Kossmann D, Gehrke J, Kraska T. Instance-optimized data layouts for cloud analytics workloads. In: Proceedings of 2021 International Conference on Management of Data. 2021, 418–431

    Chapter  MATH  Google Scholar 

  15. TPC-H benchmark. See tpc.org/tpch/ website, 1999.

  16. Rosenblatt F. Principles of neurodynamics: perceptrons and the theory of brain mechanisms. The American Journal of Psychology, 1963, 76(4): 705–707

    Article  MATH  Google Scholar 

  17. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780

    Article  MATH  Google Scholar 

  18. Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST). 2010, 1–10

    Google Scholar 

  19. Shanbhag A, Jindal A, Madden S, Quiane J, Elmore A J. A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of 2017 Symposium on Cloud Computing. 2017, 229–241

    Chapter  MATH  Google Scholar 

  20. Huang D, Liu Q, Cui Q, Fang Z, Ma X, Xu F, Shen L, Tang L, Zhou Y, Huang M, Wei W, Liu C, Zhang J, Li J, Wu X, Song L, Sun R, Yu S, Zhao L, Cameron N, Pei L, Tang X. TIDB: a raft-based HTAP database. Proceedings of the VLDB Endowment, 2020, 13(12): 3072–3084

    Article  Google Scholar 

  21. ClickHouse: an open-source columnar database management system. See clickhouse.com/docs/en/observability/managing-data website, 2016

  22. Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J S, Lee A W, Motivala A, Munir A Q, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P. The snowflake elastic data warehouse. In: Proceedings of 2016 International Conference on Management of Data. 2016, 215–226

    Chapter  Google Scholar 

  23. Moerkotte G. Small materialized aggregates: a light weight index structure for data warehousing. In: Proceedings of the 24th International Conference on Very Large Data Bases. 1998, 476–487

    MATH  Google Scholar 

  24. Graefe G. Fast loads and fast queries. In: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery. 2009, 111–124

    Chapter  MATH  Google Scholar 

  25. Kang D, Jiang R, Blanas S. Jigsaw: a data storage and query processing engine for irregular table partitioning. In: Proceedings of 2021 International Conference on Management of Data. 2021, 898–911

    Chapter  MATH  Google Scholar 

  26. Han A, Yan X, Tao S, Anerousis N. Workload characterization and pre diction in the cloud: a multiple time series approach. In: Proceedings of 2012 IEEE Network Operations and Management Symposium. 2012, 1287–1294

    MATH  Google Scholar 

  27. Pavlo A, Angulo G, Arulraj J, Lin H, Lin J, Ma L, Menon P, Mowry T C, Perron M, Quah I, Santurkar S, Tomasic A, Toor S, Van Aken D, Wang Z, Wu Y, Xian R, Zhang T. Self-driving database management systems. In: Proceedings of the 8th Biennial Conference on Innovative Data Systems Research. 2017, 1

    Google Scholar 

  28. Ma L, Van Aken D, Hefny A, Mezerhane G, Pavlo A, Gordon G J. Query-based workload forecasting for self-driving database management systems. In: Proceedings of 2018 International Conference on Management of Data. 2018, 631–645

    Chapter  Google Scholar 

  29. Hilprecht B, Binnig C, Röhm U. Learning a partitioning advisor for cloud databases. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 143–157

    Chapter  MATH  Google Scholar 

  30. Zhou X, Li G, Feng J, Liu L, Guo W. Grep: a graph learning based database partitioning system. Proceedings of the ACM on Management of Data, 2023, 1(1): 94

    Article  MATH  Google Scholar 

  31. Jindal A, Dittrich J. Relax and let the database do the partitioning online. In: Proceedings of the 5th International Workshop on Business Intelligence for the Real-Time Enterprise. 2011, 65–80

    MATH  Google Scholar 

  32. Wang J, Chai C, Liu J, Li G. Face: a normalizing flow based cardinality estimator. Proceedings of the VLDB Endowment, 2021, 15(1): 72–84

    Article  MATH  Google Scholar 

  33. Hoerl A E, Kennard R W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 1970, 12(1): 55–67

    Article  MATH  Google Scholar 

  34. Bertino E, Atzeni P, Tan K L, Chen Y, Tay Y C, Melnik S, Gubarev A, Long J J, Romer G, Shivakumar S, Tolton M, Vassilakis T. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 2010, 3(1–2): 330–339

    Google Scholar 

  35. Ray: an open source framework to build and scale your ML and Python applications. See docs.ray.io/en/latest/ website, 2017

  36. TPC-DS benchmark. See www.tpc.org/tpcds/ website, 2005

  37. JOB benchmark. See developer.imdb.com/non-commercial-datasets/ website, 2016

  38. ClickBench benchmark. See github.com/ClickHouse/ClickBench website, 2019

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (Grant No. 2023YFB4503600) and the National Natural Science Foundation of China (Grant Nos. U23A20299, 62072460, 62172424, 62276270, and 62322214).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cuiping Li.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Pengju Liu is currently pursuing his PhD degree at the School of Information, Renmin University of China, China. He received his B.S. degree in information management and information system from Dalian Maritime University, China, in 2020. His research interests include adaptable database partitioning, and load forecasting.

Pan Cai is a PhD candidate at School of Information, Renmin University of China, China. Her research interests include query optimization, learned index, and machine learning. She is particularly interested in the structural design of learned indexes on multidimensional data.

Kai Zhong is a PhD student at School of Information, Renmin University of China, China, advised by Professor Cuiping Li. He received his BS degree in Information and Computing Science at School of Information, Huazhong Agricultural University, China in June 2022. His research lie in the field of AI4DB and Machine Learning.

Cuiping Li is currently a professor at the Renmin University of China, China. She received her PhD degree from Chinese Academy of Sciences, China in 2003. Prior to that, she earned her BS and MS degrees from Xi’an Jiaotong University, China in 1994 and 1997, respectively. She is a distinguished member of CCF. Her main research interests are machine learning for database and distributed query optimization.

Hong Chen is currently a professor at the Renmin University of China, China. She received her PhD degree from Chinese Academy of Sciences, China in 2000. Before that, she received her BS and MS degrees from Renmin University of China, China in 1986 and 1989, respectively. She received Second Prize of the National Award for Science and Technology Progress in 2018. She is a committee member of CCF and CIC. Her research interests include database technology and high-performance computing.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, P., Cai, P., Zhong, K. et al. LRP: learned robust data partitioning for efficient processing of large dynamic queries. Front. Comput. Sci. 19, 199607 (2025). https://doi.org/10.1007/s11704-024-40509-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-024-40509-4

Keywords