A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Yang, Li; Xiao, Xiong; Zhang, Xuedong; Hu, Zhechang; Tang, Zhuo

doi:10.1007/s10723-023-09700-y

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Research
Published: 31 October 2023

Volume 21, article number 62, (2023)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Li Yang¹,
Xiong Xiao^1,2,
Xuedong Zhang³,
Zhechang Hu^1,2 &
…
Zhuo Tang^1,2

211 Accesses
2 Citations
Explore all metrics

Abstract

Due to the limitation of the computing power of a single node, big data is usually processed on a distributed parallel processing framework. The data in the real scene is usually not evenly distributed. Data skew will seriously affect the performance of distributed parallel computing, causing excessive load on some tasks and idle computing resources. To solve the above problems, we propose an optimization method based on step size sampling, which can more accurately predict the distribution of intermediate data. Then, we propose a balanced partitioning strategy based on adaptively adjusting computational granularity (BPAG). The adjustment of the computation granularity focuses on the characteristics of sampled data and the usage of computing resources. The balanced partition strategy distinguishes keys with different weights through weighted round-robin and efficient hashing. A partitioning strategy based on high-weight keys (HWKP) and a partitioning strategy based on low-weight keys (LWKP) are proposed. Finally, we implemented BPAG on Spark 2.4.8. We conduct comparative experiments based on four widely used big data benchmarks and five related works in the experimental evaluation. The evaluation results show that BPAG can effectively achieve partition balance and reduce task execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Research on Optimization of Data Balancing Partition Algorithm Based on Spark Platform

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

Article 25 March 2020

Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment

Article 22 June 2023

Availability of data and materials

All data generated or analysed during this study are included in this published article.

References

Song, Y., Yang, L., Wang, Y., Xiao, X., You, S., Tang, Z.: Parallel incremental association rule mining framework for public opinion analysis. Inf. Sci. 19(3), 523–545 (2023)
Article Google Scholar
Xiao, X., Li, C., Jiang, B., Cai, Q., Li, k., Tang, Z.: Adaptive search strategy based chemical reaction optimization scheme for task scheduling in discrete multiphysical coupling applications. Appl. Soft Comput. 121 (2022)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
hdfs (2021) https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs
Hadoop (2014) http://hadoop.apache.org
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Usenix conference on hot topics in cloud computing (2010)
Flink (2017) https://flink.apache.org
Anusha, K., Usha Rani, K.: Performance evaluation of spark sql for batch processing. In: Emerging research in data engineering systems and computer communications, pp. 145–153 (2020)
Cheng, G., Ying, S., Wang, B., Li, Y.: Efficient performance prediction for apache spark. J. Parallel Distrib. Comput. 149, 40–51 (2021)
Article Google Scholar
Apache spark. https://spark.apache.org/ docs/3.5.0/cluster-overview.html (2016)
Beame, P., Koutris, P., Dan, S.: Skew in parallel query processing. In: 33rd ACM SIGMODSIGACT-SIGART symposium on principles of database systems, pp. 212–223 (2014)
Tang, Z., Lv, W., Li, K., Li, K.: An intermediate data partition algorithm for skew mitigation in spark computing environment. IEEE Trans. Cloud Comput. 9(2), 461–474 (2018)
Article Google Scholar
Guo, Y., Rao, J., Cheng, D., Zhou, X.: ishuffle: Improving hadoop performance with shuffleon-write. IEEE Trans. Parallel Distrib. Syst. 28(6), 1649–1662 (2017)
Article Google Scholar
Yu, X., Kostamaa, P., Xin, Z., Liang, C.: Handling data skew in parallel joins in sharednothing systems. In: ACM SIGMOD international conference on Management of data, pp. 1043–1052 (2008)
Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G.: Efficiently handling skew in outer joins on distributed systems. In: 14th IEEE/ACM international symposium on cluster, cloud and grid computing, pp. 295–304 (2014)
Zheng, L., Shen, Y.: Improve parallelism of task execution to optimize utilization of mapreduce cluster resources. In: IEEE 17th International conference on computational science and engineering, pp. 674–681 (2015)
Zeng, Z., Li, k., Duan, M., Liu, C., Liao, X.: K-means parallel acceleration for sparse data dimensions on flink. In: 2019 IEEE 21st International conference on high performance computing and communications; IEEE 17th international conference on smart city; IEEE 5th international conference on data science and systems (HPCC/SmartCity/ DSS), pp. 2053–2058 (2019)
Liu, G., Zhu, X., Wang, J., Guo, D., Bao, W., Guo, H.: Sp-partitioner: A novel partition method to handle intermediate data skew in spark streaming. Futur. Gener. Comput. Syst. 86, 1054–1063 (2018)
Article Google Scholar
He, Z., Li, Z., Peng, X., Weng, C.: Ds2 : Handling data skew using data stealings over high-speed networks. In: 2021 IEEE 37th International conference on data engineering (ICDE), pp. 1865–1870 (2021)
Lin, J.: The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce (2012)
Tang, Z., Ma, W., Li, K., Li, K.: A data skew oriented reduce placement algorithm based on sampling. IEEE Trans. Cloud Comput. 8(4), 1149–1161 (2016)
Article Google Scholar
Vitter, J.S.: Faster methods for random sampling. Communications of the ACM 27(7), 703–718 (1984)
Article MathSciNet Google Scholar
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning spark: lightning-fast big data analysis, O’Reilly Media, Inc. (2015)
Yuan, X., Duan, Z.: Fair round-robin: A low complexity packet schduler with proportional and worst-case fairness. IEEE Trans. Comput. 58(3), 365–379 (2009)
Article MathSciNet Google Scholar
Murmurhash. https://en.wikipedia.org/wiki/MurmurHash (2016)
Hibench. https://github.com/Intel-bigdata/ HiBench (2021)
Hashpartitioner. http://spark.apache.org/ docs/latest/api/scala/index.html (2017)
Yao, X., Wang, C., Zhang, M.: Ec-shuffle: Dynamic erasure coding optimization for efficient and reliable shuffle in spark. In: 2019 19th IEEE/ACM International symposium on cluster, cloud and grid computing (CCGRID), pp. 41–51 (2019)
Ousterhout, K., Panda, A., Rosen, J., Venkataraman, S., Xin, R., Ratnasamy, S., Shenker, S., Stoica, I.:The case for tiny tasks in compute clusters. In: 14th Workshop on hot topics in operating systems (HotOSXIV). (2013)
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6(1), 29–123 (2008)
Stanford large network dataset collection (2013)

Download references

Funding

The work is supported by the National Key Research and Development Program of China (2021ZD40303), the National Natural Science Foundation of China (Grant Nos. 62225205, 92055213, 62302157), Natural Science Foundation of Hunan Province of China (2021JJ10023), Shenzhen Basic Research Project (Natural Science Foundation) (JCYJ20210324140002006), the Hunan Provincial Natural Science Foundation of China (Grant 2021JJ40612), Natural Science Foundation of Changsha of China (kq2208042).

Author information

Authors and Affiliations

Hunan University, Changsha, 410082, China
Li Yang, Xiong Xiao, Zhechang Hu & Zhuo Tang
National Supercomputing Center in Changsha, Changsha, 410082, China
Xiong Xiao, Zhechang Hu & Zhuo Tang
Beijing Institute of Space Mechanics & Electricity, Beijing, 100094, China
Xuedong Zhang

Authors

Li Yang
View author publications
You can also search for this author inPubMed Google Scholar
Xiong Xiao
View author publications
You can also search for this author inPubMed Google Scholar
Xuedong Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Zhechang Hu
View author publications
You can also search for this author inPubMed Google Scholar
Zhuo Tang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

All authors contributed to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript.

Corresponding author

Correspondence to Xiong Xiao.

Ethics declarations

Conflicts of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, L., Xiao, X., Zhang, X. et al. A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment. J Grid Computing 21, 62 (2023). https://doi.org/10.1007/s10723-023-09700-y

Download citation

Received: 10 October 2022
Accepted: 10 October 2023
Published: 31 October 2023
DOI: https://doi.org/10.1007/s10723-023-09700-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Research on Optimization of Data Balancing Partition Algorithm Based on Spark Platform

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

Cost-Aware Scheduling and Data Skew Alleviation for Big Data Processing in Heterogeneous Cloud Environment

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now