Skip to main content
Log in

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

  • Research
  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Due to the limitation of the computing power of a single node, big data is usually processed on a distributed parallel processing framework. The data in the real scene is usually not evenly distributed. Data skew will seriously affect the performance of distributed parallel computing, causing excessive load on some tasks and idle computing resources. To solve the above problems, we propose an optimization method based on step size sampling, which can more accurately predict the distribution of intermediate data. Then, we propose a balanced partitioning strategy based on adaptively adjusting computational granularity (BPAG). The adjustment of the computation granularity focuses on the characteristics of sampled data and the usage of computing resources. The balanced partition strategy distinguishes keys with different weights through weighted round-robin and efficient hashing. A partitioning strategy based on high-weight keys (HWKP) and a partitioning strategy based on low-weight keys (LWKP) are proposed. Finally, we implemented BPAG on Spark 2.4.8. We conduct comparative experiments based on four widely used big data benchmarks and five related works in the experimental evaluation. The evaluation results show that BPAG can effectively achieve partition balance and reduce task execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Availability of data and materials

All data generated or analysed during this study are included in this published article.

References

  1. Song, Y., Yang, L., Wang, Y., Xiao, X., You, S., Tang, Z.: Parallel incremental association rule mining framework for public opinion analysis. Inf. Sci. 19(3), 523–545 (2023)

    Article  Google Scholar 

  2. Xiao, X., Li, C., Jiang, B., Cai, Q., Li, k., Tang, Z.: Adaptive search strategy based chemical reaction optimization scheme for task scheduling in discrete multiphysical coupling applications. Appl. Soft Comput. 121 (2022)

  3. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  4. hdfs (2021) https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs

  5. Hadoop (2014) http://hadoop.apache.org

  6. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Usenix conference on hot topics in cloud computing (2010)

  7. Flink (2017) https://flink.apache.org

  8. Anusha, K., Usha Rani, K.: Performance evaluation of spark sql for batch processing. In: Emerging research in data engineering systems and computer communications, pp. 145–153 (2020)

  9. Cheng, G., Ying, S., Wang, B., Li, Y.: Efficient performance prediction for apache spark. J. Parallel Distrib. Comput. 149, 40–51 (2021)

    Article  Google Scholar 

  10. Apache spark. https://spark.apache.org/ docs/3.5.0/cluster-overview.html (2016)

  11. Beame, P., Koutris, P., Dan, S.: Skew in parallel query processing. In: 33rd ACM SIGMODSIGACT-SIGART symposium on principles of database systems, pp. 212–223 (2014)

  12. Tang, Z., Lv, W., Li, K., Li, K.: An intermediate data partition algorithm for skew mitigation in spark computing environment. IEEE Trans. Cloud Comput. 9(2), 461–474 (2018)

    Article  Google Scholar 

  13. Guo, Y., Rao, J., Cheng, D., Zhou, X.: ishuffle: Improving hadoop performance with shuffleon-write. IEEE Trans. Parallel Distrib. Syst. 28(6), 1649–1662 (2017)

    Article  Google Scholar 

  14. Yu, X., Kostamaa, P., Xin, Z., Liang, C.: Handling data skew in parallel joins in sharednothing systems. In: ACM SIGMOD international conference on Management of data, pp. 1043–1052 (2008)

  15. Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Efficiently handling skew in outer joins on distributed systems. In: 14th IEEE/ACM international symposium on cluster, cloud and grid computing, pp. 295–304 (2014)

  16. Zheng, L., Shen, Y.: Improve parallelism of task execution to optimize utilization of mapreduce cluster resources. In: IEEE 17th International conference on computational science and engineering, pp. 674–681 (2015)

  17. Zeng, Z., Li, k., Duan, M., Liu, C., Liao, X.: K-means parallel acceleration for sparse data dimensions on flink. In: 2019 IEEE 21st International conference on high performance computing and communications; IEEE 17th international conference on smart city; IEEE 5th international conference on data science and systems (HPCC/SmartCity/ DSS), pp. 2053–2058 (2019)

  18. Liu, G., Zhu, X., Wang, J., Guo, D., Bao, W., Guo, H.: Sp-partitioner: A novel partition method to handle intermediate data skew in spark streaming. Futur. Gener. Comput. Syst. 86, 1054–1063 (2018)

    Article  Google Scholar 

  19. He, Z., Li, Z., Peng, X., Weng, C.: Ds2 : Handling data skew using data stealings over high-speed networks. In: 2021 IEEE 37th International conference on data engineering (ICDE), pp. 1865–1870 (2021)

  20. Lin, J.: The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce (2012)

  21. Tang, Z., Ma, W., Li, K., Li, K.: A data skew oriented reduce placement algorithm based on sampling. IEEE Trans. Cloud Comput. 8(4), 1149–1161 (2016)

    Article  Google Scholar 

  22. Vitter, J.S.: Faster methods for random sampling. Communications of the ACM 27(7), 703–718 (1984)

    Article  MathSciNet  Google Scholar 

  23. Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning spark: lightning-fast big data analysis, O’Reilly Media, Inc. (2015)

  24. Yuan, X., Duan, Z.: Fair round-robin: A low complexity packet schduler with proportional and worst-case fairness. IEEE Trans. Comput. 58(3), 365–379 (2009)

    Article  MathSciNet  Google Scholar 

  25. Murmurhash. https://en.wikipedia.org/wiki/MurmurHash (2016)

  26. Hibench. https://github.com/Intel-bigdata/ HiBench (2021)

  27. Hashpartitioner. http://spark.apache.org/ docs/latest/api/scala/index.html (2017)

  28. Yao, X., Wang, C., Zhang, M.: Ec-shuffle: Dynamic erasure coding optimization for efficient and reliable shuffle in spark. In: 2019 19th IEEE/ACM International symposium on cluster, cloud and grid computing (CCGRID), pp. 41–51 (2019)

  29. Ousterhout, K., Panda, A., Rosen, J., Venkataraman, S., Xin, R., Ratnasamy, S., Shenker, S., Stoica, I.:The case for tiny tasks in compute clusters. In: 14th Workshop on hot topics in operating systems (HotOSXIV). (2013)

  30. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6(1), 29–123 (2008)

  31. Stanford large network dataset collection (2013)

Download references

Funding

The work is supported by the National Key Research and Development Program of China (2021ZD40303), the National Natural Science Foundation of China (Grant Nos. 62225205, 92055213, 62302157), Natural Science Foundation of Hunan Province of China (2021JJ10023), Shenzhen Basic Research Project (Natural Science Foundation) (JCYJ20210324140002006), the Hunan Provincial Natural Science Foundation of China (Grant 2021JJ40612), Natural Science Foundation of Changsha of China (kq2208042).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript.

Corresponding author

Correspondence to Xiong Xiao.

Ethics declarations

Conflicts of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, L., Xiao, X., Zhang, X. et al. A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment. J Grid Computing 21, 62 (2023). https://doi.org/10.1007/s10723-023-09700-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10723-023-09700-y

Keywords