Skip to main content

Multi-job Merging Framework and Scheduling Optimization for Apache Flink

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12681))

Included in the following conference series:

  • 2898 Accesses

Abstract

With the popularization of big data technology, distributed computing systems are constantly evolving and maturing, making substantial contributions to the query and analysis of massive data. However, the insufficient utilization of system resources is an inherent problem of distributed computing engines. Particularly, when more jobs lead to execution blocking, the system schedules multiple jobs on a first-come-first-executed (FCFE) basis, even if there are still many remaining resources in the cluster. Therefore, the optimization of resource utilization is key to improving the efficiency of multi-job execution. We investigated the field of multi-job execution optimization, designed a multi-job merging framework and scheduling optimization algorithm, and implemented them in the latest generation of a distributed computing system, Apache Flink. In summary, the advantages of our work are highlighted as follows: (1) the framework enables Flink to support multi-job collection, merging and dynamic tuning of the execution sequence, and the selection of these functions are customizable. (2) with the multi-job merging and optimization, the total running time can be reduced by 31% compared with traditional sequential execution. (3) the multi-job scheduling optimization algorithm can bring 28% performance improvement, and in the average case can reduce the cluster idle resources by 61%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of the International Conference on Data Engineering, pp. 1151–1162 (2011)

    Google Scholar 

  2. Carbone, P., et al.: Apache flink: stream and batch processing in a single engine. IEEE Data Eng. Bull. 38, 28–38 (2015)

    Google Scholar 

  3. Chakraborty, R., Majumdar, S.: A priority based resource scheduling technique for multitenant storm clusters. In: International Symposium on Performance Evaluation of Computer and Telecommunication Systems, pp. 1–6 (2016)

    Google Scholar 

  4. Cheng, D., Rao, J., Jiang, C., Zhou, X.: Resource and deadline-aware job scheduling in dynamic Hadoop clusters. In: IEEE International Parallel and Distributed Processing Symposium, pp. 956–965 (2015)

    Google Scholar 

  5. Ciobanu, A., Lommatzsch, A.: Development of a news recommender system based on apache flink, vol. 1609, pp. 606–617 (2016)

    Google Scholar 

  6. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1367–1372 (2004)

    Article  Google Scholar 

  7. Dean, J., Ghemawat, S.: MapReduce. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  8. Eaman, J., Cafarella, M.J., Christopher, R.: Automatic optimization for MapReduce programs. Proc. VLDB Endow. (2011)

    Google Scholar 

  9. Espinosa, C.V., Martin-Martin, E., Riesco, A., Rodriguez-Hortala, J.: FlinkCheck: property-based testing for apache flink. IEEE Access 99, 1–1 (2019)

    Google Scholar 

  10. Falkenthal, M., et al.: OpenTOSCA for the 4th industrial revolution: automating the provisioning of analytics tools based on apache flink, pp. 179–180 (2016)

    Google Scholar 

  11. Garca-Gil, D., Ramrez-Gallego, S., Garca, S., Herrera, F.: A comparison on scalability for batch big data processing on apache spark and apache flink. Big Data Anal. 2 (2017)

    Google Scholar 

  12. Hueske, F., Krettek, A., Tzoumas, K.: Enabling operator reordering in data flow programs through static code analysis. In: XLDI (2013)

    Google Scholar 

  13. Kougka, G., Gounaris, A.: Declarative expression and optimization of data-intensive flows. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2013. LNCS, vol. 8057, pp. 13–25. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40131-2_2

    Chapter  Google Scholar 

  14. Pandey, V., Saini, P.: An energy-efficient greedy MapReduce scheduler for heterogeneous Hadoop YARN cluster. In: Mondal, A., Gupta, H., Srivastava, J., Reddy, P.K., Somayajulu, D.V.L.N. (eds.) BDA 2018. LNCS, vol. 11297, pp. 282–291. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04780-1_19

    Chapter  Google Scholar 

  15. Perera, S., Perera, A., Hakimzadeh, K.: Reproducible experiments for comparing apache flink and apache spark on public clouds. arXiv:1610.04493 (2016)

  16. Radhya, S., Khafagy, M.H., Omara, F.A.: Big data multi-query optimisation with apache flink. Int. J. Web Eng. Technol. 13(1), 78 (2018)

    Article  Google Scholar 

  17. Rumi, G., Colella, C., Ardagna, D.: Optimization techniques within the Hadoop eco-system: a survey. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 437–444 (2015)

    Google Scholar 

  18. Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 829–840 (2012)

    Google Scholar 

  19. Tian, H., Zhu, Y., Wu, Y., Bressan, S., Dobbie, G.: Anomaly detection and identification scheme for VM live migration in cloud infrastructure. Future Gener. Comput. Syst. 56, 736–745 (2016)

    Article  Google Scholar 

  20. Tinghui, H., Yuliang, W., Zhen, W., Gengshen, C.: Spark I/O performance optimization based on memory and file sharing mechanism. Comput. Eng. (2017)

    Google Scholar 

  21. Wang, K., Khan, M.M.H., Nguyen, N., Gokhale, S.: Design and implementation of an analytical framework for interference aware job scheduling on apache spark platform. Cluster Comput. 22, 2223–2237 (2019). https://doi.org/10.1007/s10586-017-1466-3

    Article  Google Scholar 

  22. Yao, Y., Tai, J., Sheng, B., Mi, N.: LsPS: a job size-based scheduler for efficient task assignments in Hadoop. IEEE Trans. Cloud Comput. 3, 411–424 (2015)

    Article  Google Scholar 

  23. Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59, 56–65 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

This research was supported by the National Key R&D Program of China under Grant No. 2018YFB1004402; and the NSFC under Grant No. 61872072, 61772124, 61932004, 61732003, and 61729201; and the Fundamental Research Funds for the Central Universities under Grant No. N2016009 and N181605012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gang Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ji, H., Wu, G., Zhao, Y., Yuan, Y., Wang, G. (2021). Multi-job Merging Framework and Scheduling Optimization for Apache Flink. In: Jensen, C.S., et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12681. Springer, Cham. https://doi.org/10.1007/978-3-030-73194-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-73194-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-73193-9

  • Online ISBN: 978-3-030-73194-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics