Abstract
With the popularization of big data technology, distributed computing systems are constantly evolving and maturing, making substantial contributions to the query and analysis of massive data. However, the insufficient utilization of system resources is an inherent problem of distributed computing engines. Particularly, when more jobs lead to execution blocking, the system schedules multiple jobs on a first-come-first-executed (FCFE) basis, even if there are still many remaining resources in the cluster. Therefore, the optimization of resource utilization is key to improving the efficiency of multi-job execution. We investigated the field of multi-job execution optimization, designed a multi-job merging framework and scheduling optimization algorithm, and implemented them in the latest generation of a distributed computing system, Apache Flink. In summary, the advantages of our work are highlighted as follows: (1) the framework enables Flink to support multi-job collection, merging and dynamic tuning of the execution sequence, and the selection of these functions are customizable. (2) with the multi-job merging and optimization, the total running time can be reduced by 31% compared with traditional sequential execution. (3) the multi-job scheduling optimization algorithm can bring 28% performance improvement, and in the average case can reduce the cluster idle resources by 61%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of the International Conference on Data Engineering, pp. 1151–1162 (2011)
Carbone, P., et al.: Apache flink: stream and batch processing in a single engine. IEEE Data Eng. Bull. 38, 28–38 (2015)
Chakraborty, R., Majumdar, S.: A priority based resource scheduling technique for multitenant storm clusters. In: International Symposium on Performance Evaluation of Computer and Telecommunication Systems, pp. 1–6 (2016)
Cheng, D., Rao, J., Jiang, C., Zhou, X.: Resource and deadline-aware job scheduling in dynamic Hadoop clusters. In: IEEE International Parallel and Distributed Processing Symposium, pp. 956–965 (2015)
Ciobanu, A., Lommatzsch, A.: Development of a news recommender system based on apache flink, vol. 1609, pp. 606–617 (2016)
Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1367–1372 (2004)
Dean, J., Ghemawat, S.: MapReduce. Commun. ACM 51(1), 107–113 (2008)
Eaman, J., Cafarella, M.J., Christopher, R.: Automatic optimization for MapReduce programs. Proc. VLDB Endow. (2011)
Espinosa, C.V., Martin-Martin, E., Riesco, A., Rodriguez-Hortala, J.: FlinkCheck: property-based testing for apache flink. IEEE Access 99, 1–1 (2019)
Falkenthal, M., et al.: OpenTOSCA for the 4th industrial revolution: automating the provisioning of analytics tools based on apache flink, pp. 179–180 (2016)
Garca-Gil, D., Ramrez-Gallego, S., Garca, S., Herrera, F.: A comparison on scalability for batch big data processing on apache spark and apache flink. Big Data Anal. 2 (2017)
Hueske, F., Krettek, A., Tzoumas, K.: Enabling operator reordering in data flow programs through static code analysis. In: XLDI (2013)
Kougka, G., Gounaris, A.: Declarative expression and optimization of data-intensive flows. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2013. LNCS, vol. 8057, pp. 13–25. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40131-2_2
Pandey, V., Saini, P.: An energy-efficient greedy MapReduce scheduler for heterogeneous Hadoop YARN cluster. In: Mondal, A., Gupta, H., Srivastava, J., Reddy, P.K., Somayajulu, D.V.L.N. (eds.) BDA 2018. LNCS, vol. 11297, pp. 282–291. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04780-1_19
Perera, S., Perera, A., Hakimzadeh, K.: Reproducible experiments for comparing apache flink and apache spark on public clouds. arXiv:1610.04493 (2016)
Radhya, S., Khafagy, M.H., Omara, F.A.: Big data multi-query optimisation with apache flink. Int. J. Web Eng. Technol. 13(1), 78 (2018)
Rumi, G., Colella, C., Ardagna, D.: Optimization techniques within the Hadoop eco-system: a survey. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 437–444 (2015)
Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 829–840 (2012)
Tian, H., Zhu, Y., Wu, Y., Bressan, S., Dobbie, G.: Anomaly detection and identification scheme for VM live migration in cloud infrastructure. Future Gener. Comput. Syst. 56, 736–745 (2016)
Tinghui, H., Yuliang, W., Zhen, W., Gengshen, C.: Spark I/O performance optimization based on memory and file sharing mechanism. Comput. Eng. (2017)
Wang, K., Khan, M.M.H., Nguyen, N., Gokhale, S.: Design and implementation of an analytical framework for interference aware job scheduling on apache spark platform. Cluster Comput. 22, 2223–2237 (2019). https://doi.org/10.1007/s10586-017-1466-3
Yao, Y., Tai, J., Sheng, B., Mi, N.: LsPS: a job size-based scheduler for efficient task assignments in Hadoop. IEEE Trans. Cloud Comput. 3, 411–424 (2015)
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59, 56–65 (2016)
Acknowledgments
This research was supported by the National Key R&D Program of China under Grant No. 2018YFB1004402; and the NSFC under Grant No. 61872072, 61772124, 61932004, 61732003, and 61729201; and the Fundamental Research Funds for the Central Universities under Grant No. N2016009 and N181605012.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ji, H., Wu, G., Zhao, Y., Yuan, Y., Wang, G. (2021). Multi-job Merging Framework and Scheduling Optimization for Apache Flink. In: Jensen, C.S., et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12681. Springer, Cham. https://doi.org/10.1007/978-3-030-73194-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-73194-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73193-9
Online ISBN: 978-3-030-73194-6
eBook Packages: Computer ScienceComputer Science (R0)