Skip to main content

Efficient Batch Grouping in Relational Datasets

  • Conference paper
  • First Online:
Book cover Database Systems for Advanced Applications (DASFAA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10177))

Included in the following conference series:

  • 2744 Accesses

Abstract

Data Grouping is an expensive and frequently used operator in data processing, meanwhile data is often too big to fit in memory, where disk sorting based method is often employed. Disk sorting reads and writes the entire dataset for many times, which is very time-consuming, so reducing I/O costs is of great significants. In many applications, grouping a set of records multi-times on different keys is very common. Grouping in batch manner and techniques of sharing intermediate results are studied in this paper for efficiency. In batch grouping settings, different grouping orders may result in different I/O costs. To minimize I/O costs, we formalize the group-order scheduling problem as an optimization problem which can be proven in NP-Complete, and then propose a heuristic algorithm. Experimental results on TPC-H as well as synthetic datasets show the efficiency and robustness of our techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The TPC-H specifications can be found in www.tpc.org/tpc_documents_current_versions/current_specifications.asp.

  2. 2.

    All proofs of theorems in this paper can be found in the technical report: www.researchgate.net/publication/312070129.

References

  1. Agarwal, S., Agrawal, R., Deshpande, P. et al.: On the computation of multidimensional aggregates. In: Proceedings of 22th International Conference on Very Large Data Bases (1996)

    Google Scholar 

  2. Armstrong,W.W.: Dependency structures of data base relationships. In: IFIP Congress, pp. 580–583 (1974)

    Google Scholar 

  3. Balkesen, C., Alonso, G., Teubner, J., et al.: Multi-core, main-memory joins: sort vs. hash revisited. PVLDB 7(1), 85–96 (2013)

    Google Scholar 

  4. Cao, Y., Bramandia, R., Chan, C., et al.: Sort-sharing-aware query processing. VLDB J. 21(3), 411–436 (2012)

    Article  Google Scholar 

  5. Chandramouli, B., Goldstein, J.: Patience is a virtue: revisiting merge and sort on modern processors. In: Proceedings of 33rd International Conference on Management of Data, Snowbird, USA, pp. 731–742 (2014)

    Google Scholar 

  6. Charikar, M., Chaudhuri, S., Motwani, R. et al.: Towards estimation error guarantees for distinct values. In: Proceedings of 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Dallas, USA, pp. 268–279 (2000)

    Google Scholar 

  7. Chen, S., Jiang, S., He, B. et al.: A study of sorting algorithms on approximate memory. In: Proceedings of 35th International Conference on Management of Data, SIGMOD 2016, San Francisco, USA, pp. 647–662. ACM (2016)

    Google Scholar 

  8. Estivill-Castro, V., Wood, D.: A survey of adaptive sorting algorithms. ACM Comput. Surv. 24(4), 441–476 (1992)

    Article  Google Scholar 

  9. Fan, W., Geerts, F., Jia, X., et al.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6 (2008)

    Article  Google Scholar 

  10. Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, pp. 541–550 (2001)

    Google Scholar 

  11. Graefe, G.: Implementing sorting in database systems. ACM Comput. Surv. 38(3), 10 (2006)

    Article  Google Scholar 

  12. Guravannavar, R., Sudarshan, S.: Reducing order enforcement cost in complex query plans. In: Proceedings of 23rd International Conference on Data Engineering, Istanbul, Turkey, pp. 856–865 (2007)

    Google Scholar 

  13. Inoue, H., Taura, K.: SIMD- and cache-friendly algorithm for sorting an array of structures. PVLDB 8(11), 1274–1285 (2015)

    Google Scholar 

  14. Jünger, M. (ed.): 50 Years of Integer Programming 1958–2008: From the Early Years to the State-of-the-Art. Springer, Heidelberg (2010)

    Google Scholar 

  15. Neumann, T., Moerkotte, G.: A combined framework for grouping and order optimization. In: Proceedings of 30th International Conference on Very Large Data Bases, Toronto, Canada, pp. 960–971 (2004)

    Google Scholar 

  16. Neumann, T., Moerkotte, G.: An efficient framework for order optimization. In: Proceedings of 20th International Conference on Data Engineering, Boston, USA, pp. 461–472 (2004)

    Google Scholar 

  17. Simmen, D.E., Shekita, E.J., Malkemus, T.: Fundamental techniques for order optimization. In: Proceedings of 15th International Conference on Management of Data, Montreal, Canada, pp. 57–67 (1996)

    Google Scholar 

  18. Viglas, S.: Write-limited sorts and joins for persistent memory. PVLDB 7(5), 413–424 (2014)

    Google Scholar 

  19. Wang, X., Cherniack, M.: Avoiding sorting and grouping in processing queries. In: Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, pp. 826–837. VLDB Endowment (2003)

    Google Scholar 

  20. Xu, W., Feng, Z., Lo, E.: Fast multi-column sorting in main-memory column-stores. In: Proceedings of 35th International Conference on Management of Data, SIGMOD 2016, San Francisco, USA, pp. 1263–1278. ACM (2016)

    Google Scholar 

Download references

Acknowledgments

This work is supported in part by the Key Research and Development Plan of National Ministry of Science and Technology under grant No. 2016YFB1000703, and the Key Program of the National Natural Science Foundation of China under Grant No. 61190115, 61632010 and U1509216.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jizhou Sun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Sun, J., Li, J., Gao, H. (2017). Efficient Batch Grouping in Relational Datasets. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10177. Springer, Cham. https://doi.org/10.1007/978-3-319-55753-3_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-55753-3_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-55752-6

  • Online ISBN: 978-3-319-55753-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics