Abstract
Data Grouping is an expensive and frequently used operator in data processing, meanwhile data is often too big to fit in memory, where disk sorting based method is often employed. Disk sorting reads and writes the entire dataset for many times, which is very time-consuming, so reducing I/O costs is of great significants. In many applications, grouping a set of records multi-times on different keys is very common. Grouping in batch manner and techniques of sharing intermediate results are studied in this paper for efficiency. In batch grouping settings, different grouping orders may result in different I/O costs. To minimize I/O costs, we formalize the group-order scheduling problem as an optimization problem which can be proven in NP-Complete, and then propose a heuristic algorithm. Experimental results on TPC-H as well as synthetic datasets show the efficiency and robustness of our techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The TPC-H specifications can be found in www.tpc.org/tpc_documents_current_versions/current_specifications.asp.
- 2.
All proofs of theorems in this paper can be found in the technical report: www.researchgate.net/publication/312070129.
References
Agarwal, S., Agrawal, R., Deshpande, P. et al.: On the computation of multidimensional aggregates. In: Proceedings of 22th International Conference on Very Large Data Bases (1996)
Armstrong,W.W.: Dependency structures of data base relationships. In: IFIP Congress, pp. 580–583 (1974)
Balkesen, C., Alonso, G., Teubner, J., et al.: Multi-core, main-memory joins: sort vs. hash revisited. PVLDB 7(1), 85–96 (2013)
Cao, Y., Bramandia, R., Chan, C., et al.: Sort-sharing-aware query processing. VLDB J. 21(3), 411–436 (2012)
Chandramouli, B., Goldstein, J.: Patience is a virtue: revisiting merge and sort on modern processors. In: Proceedings of 33rd International Conference on Management of Data, Snowbird, USA, pp. 731–742 (2014)
Charikar, M., Chaudhuri, S., Motwani, R. et al.: Towards estimation error guarantees for distinct values. In: Proceedings of 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, USA, pp. 268–279 (2000)
Chen, S., Jiang, S., He, B. et al.: A study of sorting algorithms on approximate memory. In: Proceedings of 35th International Conference on Management of Data, SIGMOD 2016, San Francisco, USA, pp. 647–662. ACM (2016)
Estivill-Castro, V., Wood, D.: A survey of adaptive sorting algorithms. ACM Comput. Surv. 24(4), 441–476 (1992)
Fan, W., Geerts, F., Jia, X., et al.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6 (2008)
Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, pp. 541–550 (2001)
Graefe, G.: Implementing sorting in database systems. ACM Comput. Surv. 38(3), 10 (2006)
Guravannavar, R., Sudarshan, S.: Reducing order enforcement cost in complex query plans. In: Proceedings of 23rd International Conference on Data Engineering, Istanbul, Turkey, pp. 856–865 (2007)
Inoue, H., Taura, K.: SIMD- and cache-friendly algorithm for sorting an array of structures. PVLDB 8(11), 1274–1285 (2015)
Jünger, M. (ed.): 50 Years of Integer Programming 1958–2008: From the Early Years to the State-of-the-Art. Springer, Heidelberg (2010)
Neumann, T., Moerkotte, G.: A combined framework for grouping and order optimization. In: Proceedings of 30th International Conference on Very Large Data Bases, Toronto, Canada, pp. 960–971 (2004)
Neumann, T., Moerkotte, G.: An efficient framework for order optimization. In: Proceedings of 20th International Conference on Data Engineering, Boston, USA, pp. 461–472 (2004)
Simmen, D.E., Shekita, E.J., Malkemus, T.: Fundamental techniques for order optimization. In: Proceedings of 15th International Conference on Management of Data, Montreal, Canada, pp. 57–67 (1996)
Viglas, S.: Write-limited sorts and joins for persistent memory. PVLDB 7(5), 413–424 (2014)
Wang, X., Cherniack, M.: Avoiding sorting and grouping in processing queries. In: Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, pp. 826–837. VLDB Endowment (2003)
Xu, W., Feng, Z., Lo, E.: Fast multi-column sorting in main-memory column-stores. In: Proceedings of 35th International Conference on Management of Data, SIGMOD 2016, San Francisco, USA, pp. 1263–1278. ACM (2016)
Acknowledgments
This work is supported in part by the Key Research and Development Plan of National Ministry of Science and Technology under grant No. 2016YFB1000703, and the Key Program of the National Natural Science Foundation of China under Grant No. 61190115, 61632010 and U1509216.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Sun, J., Li, J., Gao, H. (2017). Efficient Batch Grouping in Relational Datasets. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10177. Springer, Cham. https://doi.org/10.1007/978-3-319-55753-3_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-55753-3_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55752-6
Online ISBN: 978-3-319-55753-3
eBook Packages: Computer ScienceComputer Science (R0)