Abstract
Today, there is an ever-increasing number of workloads pushed and executed on the Cloud. Data center operators and Cloud providers have embraced application co-location and multi-tenancy as first-class system design concerns to effectively serve and manage these huge computational demands. In addition, the continuous advancements in the computers’ hardware technology have made it possible to seamlessly leverage heterogeneous pools of physical machines in data center environments. Even though current modern Cloud schedulers and orchestrators adopt application-aware policies to achieve automation of time-consuming management tasks at scale, e.g., resource provisioning, they still rely on coarse-grained system metrics, such as CPU and/or memory utilization to place incoming applications, thus, not considering (1) interference effects that are provoked by co-located tasks, and (2) the impact on performance caused by the diversity of heterogeneous systems’ characteristics. The lack of such knowledge in existing state-of-the-art orchestration solutions results in their inability to perform efficient allocations, which negatively impacts the overall latency distribution delivered by the infrastructure. In this paper, to alleviate this inefficiency, we present a machine learning (ML) based Cloud orchestration extension that takes into account both resource interference and heterogeneity. The framework adequately schedules data-analytics applications on a pool of heterogeneous resources. We evaluate our proposed solution on different application mixes and co-location scenarios. We show that the proposed framework improves the tail latency of the distribution of the deployed applications by up to 3.6x compared to the state-of-the-art Kubernetes scheduler.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10766-024-00771-2/MediaObjects/10766_2024_771_Fig13_HTML.png)
Similar content being viewed by others
References
2022 Global Hybrid Cloud Trends Report. https://www.cisco.com/c/en_au/solutions/hybrid-cloud/2022-trends-report-cte.html
Google Cloud Platform: https://www.cloud.google.com. Accessed 2 Feb 2022
Amazon web services: https://aws.amazon.com/ec2. Accessed 2 Jan 2022
Microsoft azure: cloud computing services. https://azure.microsoft.com. Accessed 2 Jan 2022
Guo, J., Chang, Z., Wang, S., Ding, H., Feng, Y., Mao, L., Bao, Y.: Who limits the resource efficiency of my datacenter: An analysis of alibaba datacenter traces. In: Proceedings of the international symposium on quality of service, pp. 1–10 (2019)
Wang, L., Li, M., Zhang, Y., Ristenpart, T., Swift, M.: Peeking behind the curtains of serverless platforms. In: 2018 USENIX annual technical conference (USENIX ATC 18), pp. 133–146 (2018)
Ferikoglou, A., Masouros, D., Tzenetopoulos, A., Xydis, S., Soudris, D.: Resource aware gpu scheduling in kubernetes infrastructure. In: 12th workshop on parallel programming and run-time management techniques for many-core architectures 10th workshop on design tools (2021)
Thinakaran, P., Gunasekaran, J.R., Sharma, B., Kandemir, M.T., Das, C.R.: Kube-knots: Resource harvesting through dynamic container orchestration in gpu-based datacenters. In: 2019 IEEE international conference on cluster computing (CLUSTER), pp. 1–13. IEEE (2019)
Mars, J., Tang, L.: Whare-map: Heterogeneity in" homogeneous" warehouse-scale computers. In: Proceedings of the 40th annual international symposium on computer architecture, pp. 619–630 (2013)
Mate, J., Daudjee, K., Kamali, S.: Robust multi-tenant server consolidation in the cloud for data analytics workloads. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS), pp. 2111–2118. IEEE (2017)
Bao, Y., Peng, Y., Wu, C.: Deep learning-based job placement in distributed machine learning clusters. In: IEEE INFOCOM 2019-IEEE conference on computer communications, pp. 505–513. IEEE (2019)
Cheng, Y., Iqbal, M.S., Gupta, A., Butt, A.R.: Cast: Tiering storage for data analytics in the cloud. In: Proceedings of the 24th international symposium on high-performance parallel and distributed computing, pp. 45–56 (2015)
Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A.D., Ailamaki, A., Falsafi, B.: Clearing the clouds: a study of emerging scale-out workloads on modern hardware. Acm sigplan Not. 47(4), 37–48 (2012)
Jia, Z., Zhan, J., Wang, L., Luo, C., Gao, W., Jin, Y., Han, R., Zhang, L.: Understanding big data analytics workloads on modern processors. IEEE Trans. Parallel Distrib. Syst. 28(6), 1797–1810 (2016)
Ferikoglou, A., Chrysomeris, P., Tzenetopoulos, A., Katsaragakis, M., Masouros, D., Soudris, D.: Iris: interference and resource aware predictive orchestration for ml inference serving. IEEE CLOUD 2023 (2023)
Zhang, J., Figueiredo, R.J.: Application classification through monitoring and learning of resource consumption patterns. In: Proceedings 20th IEEE international parallel & distributed processing symposium, p. 10. IEEE (2006)
Zhuravlev, S., Blagodurov, S., Fedorova, A.: Addressing shared resource contention in multicore processors via scheduling. ACM Sigplan Not. 45(3), 129–142 (2010)
Mars, J., Vachharajani, N., Hundt, R., Soffa, M.L.: Contention aware execution: online contention detection and response. In: Proceedings of the 8th annual IEEE/ACM international symposium on code generation and optimization, pp. 257–265. ACM (2010)
Giagos, D., Tzenetopoulos, A., Masouros, D., Soudris, D., Xydis, S.: Darly: deep reinforcement learning for qos-aware scheduling under resource heterogeneity optimizing serverless video analytics. IEEE CLOUD 2023 (2023)
Mars, J., Tang, L., Hundt, R., Skadron, K., Soffa, M.L.: Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In: Proceedings of the 44th annual IEEE/ACM international symposium on microarchitecture, pp. 248–259 (2011). ACM
Yang, H., Breslow, A., Mars, J., Tang, L.: Bubble-flux: precise online qos management for increased utilization in warehouse scale computers. ACM SIGARCH Comput. Archit. News 41(3), 607–618 (2013)
Garefalakis, P., Karanasos, K., Pietzuch, P., Suresh, A., Rao, S.: M edea: scheduling of long running applications in shared production clusters. In: Proceedings of the thirteenth EuroSys conference, p. 4. ACM (2018)
Masouros, D., Xydis, S., Soudris, D.: Rusty: runtime interference-aware predictive monitoring for modern multi-tenant systems. IEEE Trans. Parallel Distrib. Syst. 32(1), 184–198 (2020)
Tzenetopoulos, A., Masouros, D., Xydis, S., Soudris, D.: Interference-aware orchestration in kubernetes. In: International conference on high performance computing, pp. 321–330. Springer (2020)
Bauman, E., Ayoade, G., Lin, Z.: A survey on hypervisor-based monitoring: approaches, applications, and evolutions. ACM Comput. Surv. 48(1), 10 (2015)
Thomas Willham, R.D.: Intel® performance counter monitor—a better way to measure cpu utilization (2012). https://software.intel.com/content/www/us/en/develop/articles/intel-performance-counter-monitor.html
Terpstra, D., Jagode, H., You, H., Dongarra, J.: Collecting performance data with papi-c. In: Tools for high performance computing 2009: Proceedings of the 3rd international workshop on parallel tools for high performance computing, September 2009, ZIH, Dresden, pp. 157–173, Springer (2010)
Prometheus-monitoring system & time series database. prometheus.io (2017)
Varia, J., Mathew, S., et al.: Overview of amazon web services. Amazon Web Serv. 105, 22 (2014)
Kanev, S., Darago, J.P., Hazelwood, K., Ranganathan, P., Moseley, T., Wei, G.-Y., Brooks, D.: Profiling a warehouse-scale computer. In: Proceedings of the 42nd annual international symposium on computer architecture, pp. 158–169 (2015)
Blagodurov, S., Fedorova, A.: User-level scheduling on numa multicore systems under linux. In: Linux symposium, vol. 2011 (2011)
Pang, P., Li, Y., Liu, B., Chen, Q., Yu, Z., Yu, Z., Zeng, D., Leng, J., Zhao, J., Guo, M.: Pac: preference-aware co-location scheduling on heterogeneous numa architectures to improve resource utilization. In: Proceedings of the 37th international conference on supercomputing, pp. 75–86 (2023)
Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., et al.: Bigdatabench: a big data benchmark suite from internet services. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp. 488–499. IEEE (2014)
Yasin, A., Ben-Asher, Y., Mendelson, A.: Deep-dive analysis of the data analytics workload in cloudsuite. In: 2014 IEEE international symposium on workload characterization (IISWC), pp. 202–211. IEEE (2014)
Tzenetopoulos, A., Masouros, D., Xydis, S., Soudris, D.: Interference-aware workload placement for improving latency distribution of converged hpc/big data cloud infrastructures. In: International conference on embedded computer systems, pp. 108–123. Springer (2022)
Marantos, C., Tzenetopoulos, A., Xydis, S., Soudris, D.: Cometes: Cross-device mapping for energy and time aware deployment on edge infrastructures. IEEE Embedded Systems Letters (2023)
Romero, F., Delimitrou, C.: Mage: online and interference-aware scheduling for multi-scale heterogeneous systems. In: Proceedings of the 27th international conference on parallel architectures and compilation techniques, pp. 1–13 (2018)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(oct), 2825–2830 (2011)
Delimitrou, C., Kozyrakis, C.: ibench: quantifying interference for datacenter applications. In: 2013 IEEE international symposium on workload characterization (IISWC), pp. 23–33. IEEE (2013)
Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A.D., Ailamaki, A., Falsafi, B.: Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In: Proceedings of the seventeenth international conference on architectural support for programming languages and operating systems (2012)
Mattson, P., Reddi, V.J., Cheng, C., Coleman, C., Diamos, G., Kanter, D., Micikevicius, P., Patterson, D., Schmuelling, G., Tang, H., et al.: Mlperf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40(2), 8–16 (2020)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. IEEE (2009)
Mars, J., Tang, L., Hundt, R.: Heterogeneity in “homogeneous’’ warehouse-scale computers: a performance opportunity. IEEE Comput. Archit. Lett. 10(2), 29–32 (2011)
Henning, J.L.: Spec cpu2006 benchmark descriptions. ACM SIGARCH Comput Archit News 34(4), 1–17 (2006)
Lee, B.C., Brooks, D.M.: Accurate and efficient regression modeling for microarchitectural performance and power prediction. ACM SIGOPS Op Syst Rev 40(5), 185–194 (2006)
McCalpin, J.D.: Stream benchmark. https://www.cs.virginia.edu/~mccalpin/STREAM_Benchmark_2005-01-25. (1995)
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Pham, T.-P., Durillo, J.J., Fahringer, T.: Predicting workflow task execution time in the cloud using a two-stage machine learning approach. IEEE Trans. Cloud Comput. 8(1), 256–268 (2017)
Chaudhury, M., Karami, A., Ghazanfar, M.A.: Large-scale music genre analysis and classification using machine learning with apache spark. Electronics 11(16), 2567 (2022)
Tzenetopoulos, A., Masouros, D., Koliogeorgi, K., Xydis, S., Soudris, D., Chazapis, A., Kozanitis, C., Bilas, A., Pinto, C., Nguyen, H.-N., et al.: Evolve: towards converging big-data, high-performance and cloud-computing worlds. In: 2022 design, automation & test in europe conference & exhibition (DATE), pp. 975–980. IEEE (2022).
Kubernetes DaemonSet: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
Naqvi, S.N.Z., Yfantidou, S., Zimányi, E.: Time series databases and influxdb. Studienarbeit, Université Libre de Bruxelles, 12 (2017)
Gan, Y., Zhang, Y., Hu, K., Cheng, D., He, Y., Pancholi, M., Delimitrou, C.: Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices. In: Proceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems, pp. 19–33 (2019)
MySQL, A.: MySQL (2001)
Advanced Message Queuing Protocol. Website. https://www.amqp.org/
RabbitMQ: https://www.rabbitmq.com/. Accessed 1 Dec 2022
Acknowledgements
The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Ph.D. Fellowships (Fellowship Number: 5349), and it was partially funded by the EU Horizon 2020 research and innovation programme, under project AIatEDGE, grant agreement No. 101015922.
Author information
Authors and Affiliations
Contributions
Achilleas Tzenetopoulos: Formal Analysis, Conceptualization, Methodology, Software, Investigation, Experiments, Visualization, Writing–original draft. Dimosthenis Masouros: Formal analysis, Conceptualization, Methodology, Investigation, Supervision, Writing–original draft, Writing–review and editing. Sotirios Xydis: Formal Analysis, Conceptualization, Methodology, Investigation, Supervision, Writing–review, and editing. Dimitrios Soudris: Conceptualization, Supervision, Writing-review.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests, as described by Springer, or personal relationships that might be perceived to influence the results and/or discussion reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tzenetopoulos, A., Masouros, D., Xydis, S. et al. Orchestration Extensions for Interference- and Heterogeneity-Aware Placement for Data-Analytics. Int J Parallel Prog 52, 298–323 (2024). https://doi.org/10.1007/s10766-024-00771-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-024-00771-2