Abstract
Spark Streaming is currently one of the mainstream stream processing frameworks which process real-time stream data by using micro-batch approach. However, there are some issues with its default task scheduling process, such as the high cost of cluster usage due to inappropriate executor placement strategy in heterogeneous cluster environments. Meanwhile, most of the current scheduling studies focus on improving the processing performance of the clusters, while ignoring the cost efficiency and service quality assurance of the clusters. In this paper, we propose a low-cost executor placement method based on resource demand prediction using machine learning under heterogeneous clusters, which is called Cost-Efficient and Best-Fit Decrease (CEBFD). First, a cost-efficient model is constructed for the Spark Streaming framework, then the Sparrow Search Algorithm (SSA) and eXtreme Gradient Boosting (XGboost) algorithm are combined to predict the resources required by streaming tasks, and finally the executor placement method for the heterogeneous Spark Streaming clusters is designed based on the cost-efficient model and resource demand prediction. Furthermore, the proposed method also improves the Service Level Agreement (SLA) of cost minimization and job deadline guarantee for streaming processing. Experimental results show that the proposed approach reduces the cluster usage cost by 6.89% to 52.24% and effectively optimizes SLA compared to existing algorithms.
Similar content being viewed by others
Availability of data and materials
The datasets generated during the current study are available from the corresponding author on reasonable request.
References
Sunyaev, A., Sunyaev, A.: Cloud computing. Internet Computing: Principles of Distributed Systems and Emerging Internet-Based Technologies, 195–236 (2020)
Kalia, K., Gupta, N.: Analysis of hadoop mapreduce scheduling in heterogeneous environment. Ain Shams Engineering Journal 12(1), 1101–1110 (2021)
Hu, Z.-Y., Zhang, Z.-H., Cheng, X.-W., Wang, F.-C., Zhang, Y.-F., Li, S.-L.: A review of multi-physical fields induced phenomena and effects in spark plasma sintering: fundamentals and applications. Materials & Design 191, 108662 (2020)
HoseinyFarahabady, M.R., Jannesari, A., Taheri, J., Bao, W., Zomaya, A.Y., Tari, Z.: Q-flink: a qos-aware controller for apache flink. In: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 629–638 (2020). IEEE
Liu, X., Buyya, R.: Resource management and scheduling in distributed stream processing systems: a taxonomy, review, and future directions. ACM Computing Surveys (CSUR) 53(3), 1–41 (2020)
Ma, H., Tang, W., Zhu, H., Zhang, H.: Resource utilization-aware collaborative optimization of iaas cloud service composition for data-intensive applications. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51(2), 1322–1333 (2019)
Weinman, J.: Hybrid cloud economics. IEEE Cloud. Computing 3(1), 18–22 (2016)
Jain, T., Hazra, J.: Hybrid cloud computing investment strategies. Prod. Oper. Manag. 28(5), 1272–1284 (2019)
Thai, L., Varghese, B., Barker, A.: A survey and taxonomy of resource optimisation for executing bag-of-task applications on public clouds. Futur. Gener. Comput. Syst. 82, 1–11 (2018)
Matteussi, K.J., Dos Anjos, J.C., Leithardt, V.R., Geyer, C.F.: Performance evaluation analysis of spark streaming backpressure for data-intensive pipelines. Sensors 22(13), 4756 (2022)
Cheng, D., Chen, Y., Zhou, X., Gmach, D., Milojicic, D.: Adaptive scheduling of parallel jobs in spark streaming. In: IEEE INFOCOM 2017-IEEE Conference on Computer Communications, pp. 1–9 (2017). IEEE
Cheng, D., Zhou, X., Wang, Y., Jiang, C.: Adaptive scheduling parallel jobs with dynamic batching in spark streaming. IEEE Trans. Parallel Distrib. Syst. 29(12), 2672–2685 (2018)
Khan, A.A., Zakarya, M.: Energy, performance and cost efficient cloud datacentres: a survey. Computer Science Review 40, 100390 (2021)
Kumar, H., Soh, P.J., Ismail, M.A.: Big data streaming platforms: a review. Iraqi Journal for Computer Science and Mathematics 3(2), 95–100 (2022)
Liu, X., Buyya, R.: Performance-oriented deployment of streaming applications on cloud. IEEE Transactions on Big Data 5(1), 46–59 (2017)
Liu, S., Weng, J., Wang, J.H., An, C., Zhou, Y., Wang, J.: An adaptive online scheme for scheduling and resource enforcement in storm. IEEE/ACM Trans. Networking 27(4), 1373–1386 (2019)
Quan, Z., Wang, Z.-J., Ye, T., Guo, S.: Task scheduling for energy consumption constrained parallel applications on heterogeneous computing systems. IEEE Trans. Parallel Distrib. Syst. 31(5), 1165–1182 (2019)
Hu, Z., Li, B., Qin, Z., Goh, R.S.M.: Low latency big data processing without prior information. IEEE Transactions on Cloud Computing 9(4), 1521–1534 (2019)
Rjoub, G., Bentahar, J., Wahab, O.A.: Bigtrustscheduling: trust-aware big data task scheduling approach in cloud computing environments. Futur. Gener. Comput. Syst. 110, 1079–1097 (2020)
Morisawa, Y., Suzuki, M., Kitahara, T.: Flexible executor allocation without latency increase for stream processing in apache spark. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 2198–2206 (2020). IEEE
Ali, H., Tariq, U.U., Zheng, Y., Zhai, X., Liu, L.: Contention & energy-aware real-time task mapping on noc based heterogeneous mpsocs. IEEE Access 6, 75110–75123 (2018)
Yang, C.-T., Chen, S.-T., Liu, J.-C., Chan, Y.-W., Chen, C.-C., Verma, V.K.: An energy-efficient cloud system with novel dynamic resource allocation methods. J. Supercomput. 75, 4408–4429 (2019)
Liu, L., Xu, H.: Elasecutor: Elastic executor scheduling in data analytics systems. IEEE/ACM Trans. Networking 29(2), 681–694 (2021)
Li, H., Xia, J., Luo, W., Fang, H.: Cost-efficient scheduling of streaming applications in apache flink on cloud. IEEE Transactions on Big Data (2022)
Li, H., Dai, H., Liu, Z., Fu, H., Zou, Y.: Dynamic energy-efficient scheduling for streaming applications in storm. Computing 104(2), 413–432 (2022)
Tariq, U.U., Ali, H., Liu, L., Panneerselvam, J., Zhai, X.: Energy-efficient static task scheduling on vfi-based noc-hmpsocs for intelligent edge devices in cyber-physical systems. ACM Transactions on Intelligent Systems and Technology (TIST) 10(6), 1–22 (2019)
Chen, R., Chen, X., Yang, C.: Using a task dependency job-scheduling method to make energy savings in a cloud computing environment. J. Supercomput. 78(3), 4550–4573 (2022)
Li, H., Zhu, L., Wang, S., Wang, L.: Cost-aware scheduling and data skew alleviation for big data processing in heterogeneous cloud environment. Journal of Grid Computing 21(3), 33 (2023)
Mangalampalli, S., Swain, S.K., Mangalampalli, V.K.: Multi objective task scheduling in cloud computing using cat swarm optimization algorithm. Arab. J. Sci. Eng. 47(2), 1821–1830 (2022)
Kakkottakath Valappil Thekkepuryil, J., Suseelan, D.P., Keerikkattil, P.M.: An effective meta-heuristic based multi-objective hybrid optimization method for workflow scheduling in cloud computing environment. Cluster Computing 24, 2367–2384 (2021)
Islam, M.T., Wu, H., Karunasekera, S., Buyya, R.: Sla-based scheduling of spark jobs in hybrid cloud computing environments. IEEE Trans. Comput. 71(5), 1117–1132 (2021)
Islam, M.T., Karunasekera, S., Buyya, R.: Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments. IEEE Trans. Parallel Distrib. Syst. 33(7), 1695–1710 (2021)
Li, H., Wang, H., Fang, S., Zou, Y., Tian, W.: An energy-aware scheduling algorithm for big data applications in spark. Clust. Comput. 23, 593–609 (2020)
Islam, M.T., Wu, H., Karunasekera, S., Buyya, R.: Sla-based scheduling of spark jobs in hybrid cloud computing environments. IEEE Trans. Comput. 71(5), 1117–1132 (2021)
Shabestari, F., Rahmani, A.M., Navimipour, N.J., Jabbehdari, S.: A yarn-based energy-aware scheduling method for big data applications under deadline constraints. Journal of Grid Computing 20(4), 38 (2022)
Li, J., Zhang, R., Zheng, Y.: Qos-aware and multi-objective virtual machine dynamic scheduling for big data centers in clouds. Soft. Comput. 26(19), 10239–10252 (2022)
Kang, Y., Pan, L., Liu, S.: An online algorithm for scheduling big data analysis jobs in cloud environments. Knowl.-Based Syst. 245, 108628 (2022)
Cheng, M., Li, J., Nazarian, S.: Drl-cloud: Deep reinforcement learning-based resource provisioning and task scheduling for cloud service providers. In: 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 129–134 (2018). IEEE
Zhang, H., Sun, D., Sajjanhar, A., Buyya, R.: A data stream prediction strategy for elastic stream computing systems. In: Broadband Communications, Networks, and Systems: 12th EAI International Conference, BROADNETS 2021, Virtual Event, October 28–29, 2021, Proceedings 12, pp. 148–162 (2022). Springer
Shi, W., Li, H., Zeng, H.: Drl-based and bsld-aware job scheduling for apache spark cluster in hybrid cloud computing environments. Journal of Grid Computing 20(4), 1–23 (2022)
Liang, Y., Zhang, C.: Resource scheduling strategy for spark in co-allocated data centers. In: International Conference on Wireless Communications, Networking and Applications, pp. 114–122 (2021). Springer
Cheng, L., Wang, Y., Cheng, F., Liu, C., Zhao, Z., Wang, Y.: A deep reinforcement learning-based preemptive approach for cost-aware cloud job scheduling. IEEE Transactions on Sustainable Computing (2023)
Cheng, F., Huang, Y., Tanpure, B., Sawalani, P., Cheng, L., Liu, C.: Cost-aware job scheduling for cloud instances using deep reinforcement learning. Cluster Computing, 1–13 (2022)
Cheng, L., Kalapgar, A., Jain, A., Wang, Y., Qin, Y., Li, Y., Liu, C.: Cost-aware real-time job scheduling for hybrid cloud using deep reinforcement learning. Neural Comput. Appl. 34(21), 18579–18593 (2022)
Zhou, G., Tian, W., Buyya, R.: Multi-search-routes-based methods for minimizing makespan of homogeneous and heterogeneous resources in cloud computing. Futur. Gener. Comput. Syst. 141, 414–432 (2023)
Samadi, Y., Zbakh, M., Tadonki, C.: Performance comparison between hadoop and spark frameworks using hibench benchmarks. Concurrency and Computation: Practice and Experience 30(12), 4367 (2018)
Sagi, O., Rokach, L.: Approximating xgboost with an interpretable decision tree. Inf. Sci. 572, 522–542 (2021)
Author information
Authors and Affiliations
Contributions
Hongjian Li: Proposed an idea, Experiment, Wrote the manuscript. Wei Luo: Proposed an idea, Experiment, Wrote the manuscript. Wenbin Xie: Helped to wrote also several sections of the manuscript, Proofreading. Huaqing Ye: Helped to wrote also several sections of the manuscript, Proofreading. Xiaolin Duan: Helped to wrote also several sections of the manuscript, Proofreading.
Corresponding author
Ethics declarations
Competing interests
None. The authors declare that they have no known conflict financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, H., Luo, W., Xie, W. et al. Adaptive Scheduling Framework of Streaming Applications based on Resource Demand Prediction with Hybrid Algorithms. J Grid Computing 22, 39 (2024). https://doi.org/10.1007/s10723-024-09756-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-024-09756-4