Abstract
Due to the limitation of the computing power of a single node, big data is usually processed on a distributed parallel processing framework. The data in the real scene is usually not evenly distributed. Data skew will seriously affect the performance of distributed parallel computing, causing excessive load on some tasks and idle computing resources. To solve the above problems, we propose an optimization method based on step size sampling, which can more accurately predict the distribution of intermediate data. Then, we propose a balanced partitioning strategy based on adaptively adjusting computational granularity (BPAG). The adjustment of the computation granularity focuses on the characteristics of sampled data and the usage of computing resources. The balanced partition strategy distinguishes keys with different weights through weighted round-robin and efficient hashing. A partitioning strategy based on high-weight keys (HWKP) and a partitioning strategy based on low-weight keys (LWKP) are proposed. Finally, we implemented BPAG on Spark 2.4.8. We conduct comparative experiments based on four widely used big data benchmarks and five related works in the experimental evaluation. The evaluation results show that BPAG can effectively achieve partition balance and reduce task execution time.
Similar content being viewed by others
Availability of data and materials
All data generated or analysed during this study are included in this published article.
References
Song, Y., Yang, L., Wang, Y., Xiao, X., You, S., Tang, Z.: Parallel incremental association rule mining framework for public opinion analysis. Inf. Sci. 19(3), 523–545 (2023)
Xiao, X., Li, C., Jiang, B., Cai, Q., Li, k., Tang, Z.: Adaptive search strategy based chemical reaction optimization scheme for task scheduling in discrete multiphysical coupling applications. Appl. Soft Comput. 121 (2022)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
hdfs (2021) https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs
Hadoop (2014) http://hadoop.apache.org
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Usenix conference on hot topics in cloud computing (2010)
Flink (2017) https://flink.apache.org
Anusha, K., Usha Rani, K.: Performance evaluation of spark sql for batch processing. In: Emerging research in data engineering systems and computer communications, pp. 145–153 (2020)
Cheng, G., Ying, S., Wang, B., Li, Y.: Efficient performance prediction for apache spark. J. Parallel Distrib. Comput. 149, 40–51 (2021)
Apache spark. https://spark.apache.org/ docs/3.5.0/cluster-overview.html (2016)
Beame, P., Koutris, P., Dan, S.: Skew in parallel query processing. In: 33rd ACM SIGMODSIGACT-SIGART symposium on principles of database systems, pp. 212–223 (2014)
Tang, Z., Lv, W., Li, K., Li, K.: An intermediate data partition algorithm for skew mitigation in spark computing environment. IEEE Trans. Cloud Comput. 9(2), 461–474 (2018)
Guo, Y., Rao, J., Cheng, D., Zhou, X.: ishuffle: Improving hadoop performance with shuffleon-write. IEEE Trans. Parallel Distrib. Syst. 28(6), 1649–1662 (2017)
Yu, X., Kostamaa, P., Xin, Z., Liang, C.: Handling data skew in parallel joins in sharednothing systems. In: ACM SIGMOD international conference on Management of data, pp. 1043–1052 (2008)
Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Efficiently handling skew in outer joins on distributed systems. In: 14th IEEE/ACM international symposium on cluster, cloud and grid computing, pp. 295–304 (2014)
Zheng, L., Shen, Y.: Improve parallelism of task execution to optimize utilization of mapreduce cluster resources. In: IEEE 17th International conference on computational science and engineering, pp. 674–681 (2015)
Zeng, Z., Li, k., Duan, M., Liu, C., Liao, X.: K-means parallel acceleration for sparse data dimensions on flink. In: 2019 IEEE 21st International conference on high performance computing and communications; IEEE 17th international conference on smart city; IEEE 5th international conference on data science and systems (HPCC/SmartCity/ DSS), pp. 2053–2058 (2019)
Liu, G., Zhu, X., Wang, J., Guo, D., Bao, W., Guo, H.: Sp-partitioner: A novel partition method to handle intermediate data skew in spark streaming. Futur. Gener. Comput. Syst. 86, 1054–1063 (2018)
He, Z., Li, Z., Peng, X., Weng, C.: Ds2 : Handling data skew using data stealings over high-speed networks. In: 2021 IEEE 37th International conference on data engineering (ICDE), pp. 1865–1870 (2021)
Lin, J.: The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce (2012)
Tang, Z., Ma, W., Li, K., Li, K.: A data skew oriented reduce placement algorithm based on sampling. IEEE Trans. Cloud Comput. 8(4), 1149–1161 (2016)
Vitter, J.S.: Faster methods for random sampling. Communications of the ACM 27(7), 703–718 (1984)
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning spark: lightning-fast big data analysis, O’Reilly Media, Inc. (2015)
Yuan, X., Duan, Z.: Fair round-robin: A low complexity packet schduler with proportional and worst-case fairness. IEEE Trans. Comput. 58(3), 365–379 (2009)
Murmurhash. https://en.wikipedia.org/wiki/MurmurHash (2016)
Hibench. https://github.com/Intel-bigdata/ HiBench (2021)
Hashpartitioner. http://spark.apache.org/ docs/latest/api/scala/index.html (2017)
Yao, X., Wang, C., Zhang, M.: Ec-shuffle: Dynamic erasure coding optimization for efficient and reliable shuffle in spark. In: 2019 19th IEEE/ACM International symposium on cluster, cloud and grid computing (CCGRID), pp. 41–51 (2019)
Ousterhout, K., Panda, A., Rosen, J., Venkataraman, S., Xin, R., Ratnasamy, S., Shenker, S., Stoica, I.:The case for tiny tasks in compute clusters. In: 14th Workshop on hot topics in operating systems (HotOSXIV). (2013)
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6(1), 29–123 (2008)
Stanford large network dataset collection (2013)
Funding
The work is supported by the National Key Research and Development Program of China (2021ZD40303), the National Natural Science Foundation of China (Grant Nos. 62225205, 92055213, 62302157), Natural Science Foundation of Hunan Province of China (2021JJ10023), Shenzhen Basic Research Project (Natural Science Foundation) (JCYJ20210324140002006), the Hunan Provincial Natural Science Foundation of China (Grant 2021JJ40612), Natural Science Foundation of Changsha of China (kq2208042).
Author information
Authors and Affiliations
Contributions
All authors contributed to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, L., Xiao, X., Zhang, X. et al. A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment. J Grid Computing 21, 62 (2023). https://doi.org/10.1007/s10723-023-09700-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-023-09700-y