Abstract
Data-intensive computing systems in cloud datacenters create long-lived containers and allocate memory resource for them to execute long-running applications. It is a challenge to exactly estimate how much memory should be reserved for containers to enable smooth application execution and high resource utilization as well. Current state-of-the-art work has two limitations. First, prediction accuracy is restricted by the monotonicity of the iterative search. Second, application performance fluctuates due to the termination conditions. In this paper, we propose two improved strategies based on MEER, called MEER+ and Deep-MEER, which are designed to assist in memory allocation upon resource manager like YARN. MEER+ has one more step of approximation than MEER, to make the iterative search bi-directional and better approach the optimal value. Based on reinforcement learning and rich data, Deep-MEER achieves thrashing-avoiding estimation without involving termination conditions. Based on the different input requirements and advantages, a scheme to adaptively adopt MEER+ and Deep-MEER in cluster life cycle is proposed. We have evaluated MEER+ and Deep-MEER. Our experimental results show that MEER+ and Deep-MEER yield up to 88% and 20% higher accuracy. Moreover, Deep-MEER guarantees stable performance for applications during recurring executions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhang, W., Wang, L., Cheng, Y.: Performance optimization of Lustre file system based on reinforcement learning. J. Comput. Res. Dev. 56(7), 1578–1586 (2019)
Zhao, T., Dong, S., March, V., et al.: Predicting the parallel file system performance via machine learning. J. Comput. Res. Dev. 48(7), 1202–1215 (2011)
Boutin, E., Ekanayake, J., Lin, W., et al.: Apollo: scalable and coordinated scheduling for cloud-scale computing. In: 11th Symposium on Operating Systems Design and Implementation (OSDI), Broomfield, CO, pp. 285–300 (2014)
Zhu, Y., Liu, J., Guo, M., et al.: Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: 2017 Symposium on Cloud Computing (SoCC), Santa Clara, California, pp. 338–350 (2017)
Vavilapalli, V.K., Murthy, A.C., Dougla, C., et al.: Apache Hadoop YARN: yet another resource negotiator. In: 4th Annual Symposium on Cloud Computing (SoCC), Santa Clara, California, no. 5, pp. 1–16 (2013)
Hindman, B., Konwinski, A., Zaharia, M., et al.: Mesos: a platform for fine-grained resource sharing in the data center. In: 8th conference on Networked Systems Design and Implementation (NSDI), Boston, MA, pp. 295–308 (2011)
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., et al.: Omega: flexible, scalable schedulers for large compute clusters. In: 8th ACM European Conference on Computer Systems (EuroSys), Prague, Czech Republic, pp. 351–364 (2013)
Verma, A., Pedrosa, L., Korupolu, M., et al.: Large-scale cluster management at Google with Borg. In: 10th European Conference on Computer Systems (EuroSys), Bordeaux, France, no. 18, pp. 1–17 (2015)
Burns, B., Grant, B., Oppenheimer, D., et al.: Borg, Omega, and Kubernetes. In: Communications of the ACM, New York, USA, vol. 59, no. 5, pp. 50–57 (2016)
Xu, G., Xu, C.: MEER: online estimation of optimal memory reservations for long lived containers in in-memory cluster computing. In: 39th IEEE International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, pp. 23–34 (2019)
Garefalakis, P., Karanasos, K., Pietzuch, P., et al.: Medea: scheduling of long running applications in shared production clusters. In: 13th EuroSys Conference, Porto, Portugal, no. 4, pp. 1–13 (2018)
Chen, H., Jiang, G., Zhang, H., et al.: Boosting the performance of computing systems through adaptive configuration tuning. In: 2009 ACM Symposium on Applied Computing (SAC), Honolulu, Hawaii, pp. 1045–1049 (2009)
Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: a system for large-scale machine learning. In: 12th conference on Operating Systems Design and Implementation (OSDI), Savannah, GA, USA, pp. 265–283 (2016)
Meng, X., Bradley, J., Yavuz, B., et al.: Mllib: Machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a faulttolerant abstraction for in-memory cluster computing. In: 9th Conference on Networked Systems Design and Implementation (NSDI), San Jose, CA, p. 2 (2012)
Apache flink. http://flink.apache.org. Accessed 30 Mar 2020
Toshniwal, A., Taneja, S., Shukla, A., et al.: Storm@twitter. In: 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, Utah, USA, pp. 147–156 (2014)
Zaharia, M., Das, T., Li, H., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: 24th ACM Symposium on Operating Systems Principles (SOSP), Farminton, Pennsylvania, pp. 423–438 (2013)
Armbrust, M., Xin, R.S., Lian, C., et al.: Spark SQL: relational data processing in spark. In: 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, pp. 1383–1394 (2015)
Kornacker, M., Behm, A., Bittorf, V., et al.: Impala: a modern, open-source SQL engine for hadoop. In: 7th Biennial Conference on Innovative Data Systems Research (CIDR) (2015)
Saha, B., Shah, H., Seth, S., et al.: Apache Tez: a unifying framework for modeling and building data processing applications. In: 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, pp. 1357–1369 (2015)
Gonzalez, J.E., Xin, R.S., Dave, A., et al.: Graphx: graph processing in a distributed dataflow framework. In: 11th Conference on Operating Systems Design and Implementation (OSDI), Broomfield, CO, pp. 599–613 (2014)
Low, Y., Bickson, D., Gonzalez, J., et al.: Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB Endow. 5(8), 716–727 (2012)
Malewicz, G., Austern, M.H., Bik, A.J.C., et al.: Pregel: a system for large-scale graph processing. In: 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA, pp. 135–146 (2010)
Iorgulescu, C., Dinu, F., Raza, A., et al.: Don’t cry over spilled records: memory elasticity of data-parallel applications and its application to cluster scheduling. In: Annual Technical Conference (ATC), pp. 97–109 (2017)
Herodotou, H., Dong, F., Babu, S.: No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: 2nd ACM Symposium on Cloud Computing (SoCC), Cascais, Portugal, no. 18, pp. 1–14 (2011)
Alipourfard, O., Liu, H.H., Chen, J., et al.: Cherrypick: adaptively unearthing the best cloud configurations for big data analytics. In: 14th Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, pp. 469–482 (2017)
Huynh, N.V., Nguyen, D.N., Dutkiewicz, E.: Optimal and fast real-time resource slicing with deep dueling neural networks. IEEE J. Sel. Areas Commun. 37(6), 1455–1470 (2019)
Xu, G., Xu, C.: Prometheus: online estimation of optimal memory demands for workers in in-memory distributed computation. In: The ACM Symposium on Cloud Computing (SoCC), Santa Clara, California, p. 655 (2017)
Chen, W., Pi, A., Wang, S., et al.: Pufferfish: container-driven elastic memory management for data-intensive applications. In: the ACM Symposium on Cloud Computing (SoCC), Santa Cruz, CA, USA, pp. 259–271 (2019)
Klimovic, A., Litz, H., Kozyrakis, C.: Selecta: heterogeneous cloud storage configuration for data analytics. In: 2018 USENIX Annual Technical Conference (ATC), Boston, MA, USA, pp. 759–773 (2018)
Liu, L., Xu, H.: Elasecutor: elastic executor scheduling in data analytics systems. In: The ACM Symposium on Cloud Computing (SoCC), Carlsbad, CA, USA, pp. 107–120 (2018)
Peng, G., Wang, H., Dong, J., et al.: Knowledge-based resource allocation for collaborative simulation development in a multi-tenant cloud computing environment. IEEE Trans. Serv. Comput. (TSC) 11(2), 306–317 (2018)
Erradi, A., Iqbal, W., Mahmood, A., et al.: Web application resource requirements estimation based on the workload latent features. IEEE Trans. Serv. Comput. (TSC), 1(2019)
Kholidy, H.A.: An intelligent swarm based prediction approach for predicting clou computing user resource needs. Comput. Commun. (CC) 151, 133–144 (2020)
Acknowledgments
This work is supported by The National Key Research and Development Program of China (2019YFB1804502), Key-Area Research and Development Program of Guangdong Province (Grant No.2019B010107001), and National Natural Science Foundation of China (Grant No.61832020, 61702569).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Lin, J., Liu, F., Cai, Z., Huang, Z., Li, W., Xiao, N. (2021). Adaptive Online Estimation of Thrashing-Avoiding Memory Reservations for Long-Lived Containers. In: Gao, H., Wang, X., Iqbal, M., Yin, Y., Yin, J., Gu, N. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2020. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 349. Springer, Cham. https://doi.org/10.1007/978-3-030-67537-0_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-67537-0_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67536-3
Online ISBN: 978-3-030-67537-0
eBook Packages: Computer ScienceComputer Science (R0)