Skip to main content
Log in

PowerHash: a hybrid grouping scheme by leveraging power-law properties of data

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

We study GroupBy implementation scheme which is widely used in distributed systems and databases. The GroupBy operation partitions a set of out-of-order records into groups. Due to the massive data size, many I/O-efficient grouping schemes that exploit external memory have been proposed. In this paper, we observe that the group sizes of many real data exhibit power-law property and the grouping schemes’ performance varies a lot for data with different group sizes. The indexing–filling approach prefers data with big group size, while the partitioned hash approach prefers data with small group size. Based on this observation, we propose a hybrid approach, PowerHash, which invokes different grouping schemes for different data. The group size information is approximately estimated by the count-min sketch so that the big groups and small groups can be distinguished from each other. With a given memory budget, our results show that PowerHash can improve performance by up to six times over the existing GroupBy implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://snap.standford.edu/data/higgs-twitter.html.

  2. http://snap.stanford.edu/data/web-BerkStan.html.

  3. http://snap.stanford.edu/data/web-Google.html.

References

  1. Adamic, L.: The nature of markets in the world wide web. Quarterly J. Electron. Commer. 1(1) (2000)

  2. Agrawal, S., Chaudhuri, S., Kollar, L., Marathe, A., Narasayya, V., Syamala, M.: Database tuning advisor for microsoft SQL server 2005: demo. In: ACM SIGMOD International Conference on Management of Data (SIGMOD 2005), pp. 930–932. ACM (2005)

  3. Bartholomew, D.: Mariadb vs. MYSQL. Dostopano 7(10), 2014 (2012)

    Google Scholar 

  4. Boicea, A., Radulescu, F., Agapin, L.I.: Mongodb vs oracle-database comparison. In: EIDWT 2012, pp. 330–335 (2012)

  5. Bratbergsengen, K.: Hashing methods and relational algebra operations. VLDB 1984, 323–333 (1984)

    Google Scholar 

  6. Cormode, G.: Count-min sketch. Encycl. Algorithms 29(1), 64–69 (2009)

    Google Scholar 

  7. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. In: Farach-Colton, M. (ed.) LATIN 2004: Theoretical Informatics, pp. 29–38. Springer, Berlin (2004)

    Chapter  Google Scholar 

  8. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  9. Freedman, C.: Hash aggregate (2006). https://blogs.msdn.microsoft.com/craigfr/2006/09/20/hash-aggregate/. Accessed 2018

  10. George, K., George, K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley Press, Boston (1949)

    Google Scholar 

  11. Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1(1), 29–53 (1997)

    Article  Google Scholar 

  12. Khattree, R., Bahuguna, M.: An alternative data analytic approach to measure the univariate and multivariate skewness. Int. J. Data Sci. Anal. 1, 1–16 (2018)

    Google Scholar 

  13. Li, B., Mazur, E., Diao, Y., Mcgregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: ACM SIGMOD International Conference on Management of Data (SIGMOD 2011), pp. 985–996 (2011)

  14. Lin, L., Lychagina, V., Liu, W., Kwon, Y., Mittal, S., Wong, M.: Tenzing a SQL implementation on the mapreduce framework. PVLDB 2011, 1318–1327 (2011)

    Google Scholar 

  15. Momjian, B.: PostgreSQL: Introduction and Concepts, vol. 192. Addison-Wesley, New York (2001)

    Google Scholar 

  16. MySQL, A.: Mysql 5.1 reference manual, 2006 (2009). http://dev.mysql.com/doc. Accessed 2018

  17. Nasir, M.A.U., Morales, G.D.F., García-Soriano, D., Kourtellis, N., Serafini, M.: The power of both choices: Practical load balancing for distributed stream processing engines. In: IEEE 31st International Conference on Data Engineering (ICDE 2015), pp. 137–148. IEEE (2015)

  18. Newman, M.: Power laws, pareto distributions and zipf’s law. Contemp. Phys. 46(5), 323–351 (2005)

    Article  Google Scholar 

  19. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE Symposium on Mass storage systems and technologies (MSST 2010), pp. 1–10. IEEE (2010)

  20. Stephens, S.M., Chen, J.Y., Davidson, M.G., Thomas, S., Trute, B.M.: Oracle database 10g: a platform for blast search and regular expression pattern matching in life sciences. Nucleic Acids Res. 33(1), D675–D679 (2005)

    Google Scholar 

  21. Teffer, D., Srinivasan, R., Ghosh, J.: Adahash: hashing-based scalable, adaptive hierarchical clustering of streaming data on mapreduce frameworks. Int. J. Data Sci. Anal. 1–11, (2018). https://doi.org/10.1007/s41060-018-0145-7

  22. Yu, Y., Gunda, P.K., Isard, M.: Distributed aggregation for data-parallel computing: interfaces and implementations. In: ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP 2009), pp. 247–260. ACM (2009)

Download references

Acknowledgements

This work was partially supported by National Key R&D Program of China (2018YFB1003404), National Natural Science Foundation of China (61672141) and Fundamental Research Funds for the Central Universities (N181605017).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanfeng Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, X., Kong, X., Zhang, Y. et al. PowerHash: a hybrid grouping scheme by leveraging power-law properties of data. Int J Data Sci Anal 9, 273–284 (2020). https://doi.org/10.1007/s41060-019-00192-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-019-00192-2

Keywords

Navigation