Efficient Batch Grouping in Relational Datasets

Sun, Jizhou; Li, Jianzhong; Gao, Hong

doi:10.1007/978-3-319-55753-3_24

Jizhou Sun¹⁸,
Jianzhong Li¹⁸ &
Hong Gao¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10177))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2744 Accesses

Abstract

Data Grouping is an expensive and frequently used operator in data processing, meanwhile data is often too big to fit in memory, where disk sorting based method is often employed. Disk sorting reads and writes the entire dataset for many times, which is very time-consuming, so reducing I/O costs is of great significants. In many applications, grouping a set of records multi-times on different keys is very common. Grouping in batch manner and techniques of sharing intermediate results are studied in this paper for efficiency. In batch grouping settings, different grouping orders may result in different I/O costs. To minimize I/O costs, we formalize the group-order scheduling problem as an optimization problem which can be proven in NP-Complete, and then propose a heuristic algorithm. Experimental results on TPC-H as well as synthetic datasets show the efficiency and robustness of our techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The TPC-H specifications can be found in www.tpc.org/tpc_documents_current_versions/current_specifications.asp.
2.
All proofs of theorems in this paper can be found in the technical report: www.researchgate.net/publication/312070129.

References

Agarwal, S., Agrawal, R., Deshpande, P. et al.: On the computation of multidimensional aggregates. In: Proceedings of 22th International Conference on Very Large Data Bases (1996)
Google Scholar
Armstrong,W.W.: Dependency structures of data base relationships. In: IFIP Congress, pp. 580–583 (1974)
Google Scholar
Balkesen, C., Alonso, G., Teubner, J., et al.: Multi-core, main-memory joins: sort vs. hash revisited. PVLDB 7(1), 85–96 (2013)
Google Scholar
Cao, Y., Bramandia, R., Chan, C., et al.: Sort-sharing-aware query processing. VLDB J. 21(3), 411–436 (2012)
Article Google Scholar
Chandramouli, B., Goldstein, J.: Patience is a virtue: revisiting merge and sort on modern processors. In: Proceedings of 33rd International Conference on Management of Data, Snowbird, USA, pp. 731–742 (2014)
Google Scholar
Charikar, M., Chaudhuri, S., Motwani, R. et al.: Towards estimation error guarantees for distinct values. In: Proceedings of 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, USA, pp. 268–279 (2000)
Google Scholar
Chen, S., Jiang, S., He, B. et al.: A study of sorting algorithms on approximate memory. In: Proceedings of 35th International Conference on Management of Data, SIGMOD 2016, San Francisco, USA, pp. 647–662. ACM (2016)
Google Scholar
Estivill-Castro, V., Wood, D.: A survey of adaptive sorting algorithms. ACM Comput. Surv. 24(4), 441–476 (1992)
Article Google Scholar
Fan, W., Geerts, F., Jia, X., et al.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6 (2008)
Article Google Scholar
Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, pp. 541–550 (2001)
Google Scholar
Graefe, G.: Implementing sorting in database systems. ACM Comput. Surv. 38(3), 10 (2006)
Article Google Scholar
Guravannavar, R., Sudarshan, S.: Reducing order enforcement cost in complex query plans. In: Proceedings of 23rd International Conference on Data Engineering, Istanbul, Turkey, pp. 856–865 (2007)
Google Scholar
Inoue, H., Taura, K.: SIMD- and cache-friendly algorithm for sorting an array of structures. PVLDB 8(11), 1274–1285 (2015)
Google Scholar
Jünger, M. (ed.): 50 Years of Integer Programming 1958–2008: From the Early Years to the State-of-the-Art. Springer, Heidelberg (2010)
Google Scholar
Neumann, T., Moerkotte, G.: A combined framework for grouping and order optimization. In: Proceedings of 30th International Conference on Very Large Data Bases, Toronto, Canada, pp. 960–971 (2004)
Google Scholar
Neumann, T., Moerkotte, G.: An efficient framework for order optimization. In: Proceedings of 20th International Conference on Data Engineering, Boston, USA, pp. 461–472 (2004)
Google Scholar
Simmen, D.E., Shekita, E.J., Malkemus, T.: Fundamental techniques for order optimization. In: Proceedings of 15th International Conference on Management of Data, Montreal, Canada, pp. 57–67 (1996)
Google Scholar
Viglas, S.: Write-limited sorts and joins for persistent memory. PVLDB 7(5), 413–424 (2014)
Google Scholar
Wang, X., Cherniack, M.: Avoiding sorting and grouping in processing queries. In: Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, pp. 826–837. VLDB Endowment (2003)
Google Scholar
Xu, W., Feng, Z., Lo, E.: Fast multi-column sorting in main-memory column-stores. In: Proceedings of 35th International Conference on Management of Data, SIGMOD 2016, San Francisco, USA, pp. 1263–1278. ACM (2016)
Google Scholar

Download references

Acknowledgments

This work is supported in part by the Key Research and Development Plan of National Ministry of Science and Technology under grant No. 2016YFB1000703, and the Key Program of the National Natural Science Foundation of China under Grant No. 61190115, 61632010 and U1509216.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Jizhou Sun, Jianzhong Li & Hong Gao

Authors

Jizhou Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jizhou Sun .

Editor information

Editors and Affiliations

Arizona State University , Tempe - Phoenix, Arizona, USA
Selçuk Candan
Hong Kong University of Science and Tech , Hong Kong, China
Lei Chen
Aalborg University , Aalborg, Denmark
Torben Bach Pedersen
University of New South Wales , Sydney, New South Wales, Australia
Lijun Chang
The University of Queensland , Brisbane, Queensland, Australia
Wen Hua

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, J., Li, J., Gao, H. (2017). Efficient Batch Grouping in Relational Datasets. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10177. Springer, Cham. https://doi.org/10.1007/978-3-319-55753-3_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-55753-3_24
Published: 22 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55752-6
Online ISBN: 978-3-319-55753-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics