Adaptive Online Estimation of Thrashing-Avoiding Memory Reservations for Long-Lived Containers

Lin, Jiayun; Liu, Fang; Cai, Zhenhua; Huang, Zhijie; Li, Weijun; Xiao, Nong

doi:10.1007/978-3-030-67537-0_37

Jiayun Lin²¹,
Fang Liu²¹,
Zhenhua Cai²¹,
Zhijie Huang²¹,
Weijun Li²² &
…
Nong Xiao²¹

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 349))

Included in the following conference series:

International Conference on Collaborative Computing: Networking, Applications and Worksharing

1258 Accesses

Abstract

Data-intensive computing systems in cloud datacenters create long-lived containers and allocate memory resource for them to execute long-running applications. It is a challenge to exactly estimate how much memory should be reserved for containers to enable smooth application execution and high resource utilization as well. Current state-of-the-art work has two limitations. First, prediction accuracy is restricted by the monotonicity of the iterative search. Second, application performance fluctuates due to the termination conditions. In this paper, we propose two improved strategies based on MEER, called MEER+ and Deep-MEER, which are designed to assist in memory allocation upon resource manager like YARN. MEER+ has one more step of approximation than MEER, to make the iterative search bi-directional and better approach the optimal value. Based on reinforcement learning and rich data, Deep-MEER achieves thrashing-avoiding estimation without involving termination conditions. Based on the different input requirements and advantages, a scheme to adaptively adopt MEER+ and Deep-MEER in cluster life cycle is proposed. We have evaluated MEER+ and Deep-MEER. Our experimental results show that MEER+ and Deep-MEER yield up to 88% and 20% higher accuracy. Moreover, Deep-MEER guarantees stable performance for applications during recurring executions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhang, W., Wang, L., Cheng, Y.: Performance optimization of Lustre file system based on reinforcement learning. J. Comput. Res. Dev. 56(7), 1578–1586 (2019)
Google Scholar
Zhao, T., Dong, S., March, V., et al.: Predicting the parallel file system performance via machine learning. J. Comput. Res. Dev. 48(7), 1202–1215 (2011)
Google Scholar
Boutin, E., Ekanayake, J., Lin, W., et al.: Apollo: scalable and coordinated scheduling for cloud-scale computing. In: 11th Symposium on Operating Systems Design and Implementation (OSDI), Broomfield, CO, pp. 285–300 (2014)
Google Scholar
Zhu, Y., Liu, J., Guo, M., et al.: Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: 2017 Symposium on Cloud Computing (SoCC), Santa Clara, California, pp. 338–350 (2017)
Google Scholar
Vavilapalli, V.K., Murthy, A.C., Dougla, C., et al.: Apache Hadoop YARN: yet another resource negotiator. In: 4th Annual Symposium on Cloud Computing (SoCC), Santa Clara, California, no. 5, pp. 1–16 (2013)
Google Scholar
Hindman, B., Konwinski, A., Zaharia, M., et al.: Mesos: a platform for fine-grained resource sharing in the data center. In: 8th conference on Networked Systems Design and Implementation (NSDI), Boston, MA, pp. 295–308 (2011)
Google Scholar
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., et al.: Omega: flexible, scalable schedulers for large compute clusters. In: 8th ACM European Conference on Computer Systems (EuroSys), Prague, Czech Republic, pp. 351–364 (2013)
Google Scholar
Verma, A., Pedrosa, L., Korupolu, M., et al.: Large-scale cluster management at Google with Borg. In: 10th European Conference on Computer Systems (EuroSys), Bordeaux, France, no. 18, pp. 1–17 (2015)
Google Scholar
Burns, B., Grant, B., Oppenheimer, D., et al.: Borg, Omega, and Kubernetes. In: Communications of the ACM, New York, USA, vol. 59, no. 5, pp. 50–57 (2016)
Google Scholar
Xu, G., Xu, C.: MEER: online estimation of optimal memory reservations for long lived containers in in-memory cluster computing. In: 39th IEEE International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, pp. 23–34 (2019)
Google Scholar
Garefalakis, P., Karanasos, K., Pietzuch, P., et al.: Medea: scheduling of long running applications in shared production clusters. In: 13th EuroSys Conference, Porto, Portugal, no. 4, pp. 1–13 (2018)
Google Scholar
Chen, H., Jiang, G., Zhang, H., et al.: Boosting the performance of computing systems through adaptive configuration tuning. In: 2009 ACM Symposium on Applied Computing (SAC), Honolulu, Hawaii, pp. 1045–1049 (2009)
Google Scholar
Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: a system for large-scale machine learning. In: 12th conference on Operating Systems Design and Implementation (OSDI), Savannah, GA, USA, pp. 265–283 (2016)
Google Scholar
Meng, X., Bradley, J., Yavuz, B., et al.: Mllib: Machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
MathSciNet MATH Google Scholar
Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a faulttolerant abstraction for in-memory cluster computing. In: 9th Conference on Networked Systems Design and Implementation (NSDI), San Jose, CA, p. 2 (2012)
Google Scholar
Apache flink. http://flink.apache.org. Accessed 30 Mar 2020
Toshniwal, A., Taneja, S., Shukla, A., et al.: Storm@twitter. In: 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, Utah, USA, pp. 147–156 (2014)
Google Scholar
Zaharia, M., Das, T., Li, H., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: 24th ACM Symposium on Operating Systems Principles (SOSP), Farminton, Pennsylvania, pp. 423–438 (2013)
Google Scholar
Armbrust, M., Xin, R.S., Lian, C., et al.: Spark SQL: relational data processing in spark. In: 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, pp. 1383–1394 (2015)
Google Scholar
Kornacker, M., Behm, A., Bittorf, V., et al.: Impala: a modern, open-source SQL engine for hadoop. In: 7th Biennial Conference on Innovative Data Systems Research (CIDR) (2015)
Google Scholar
Saha, B., Shah, H., Seth, S., et al.: Apache Tez: a unifying framework for modeling and building data processing applications. In: 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, pp. 1357–1369 (2015)
Google Scholar
Gonzalez, J.E., Xin, R.S., Dave, A., et al.: Graphx: graph processing in a distributed dataflow framework. In: 11th Conference on Operating Systems Design and Implementation (OSDI), Broomfield, CO, pp. 599–613 (2014)
Google Scholar
Low, Y., Bickson, D., Gonzalez, J., et al.: Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB Endow. 5(8), 716–727 (2012)
Article Google Scholar
Malewicz, G., Austern, M.H., Bik, A.J.C., et al.: Pregel: a system for large-scale graph processing. In: 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA, pp. 135–146 (2010)
Google Scholar
Iorgulescu, C., Dinu, F., Raza, A., et al.: Don’t cry over spilled records: memory elasticity of data-parallel applications and its application to cluster scheduling. In: Annual Technical Conference (ATC), pp. 97–109 (2017)
Google Scholar
Herodotou, H., Dong, F., Babu, S.: No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: 2nd ACM Symposium on Cloud Computing (SoCC), Cascais, Portugal, no. 18, pp. 1–14 (2011)
Google Scholar
Alipourfard, O., Liu, H.H., Chen, J., et al.: Cherrypick: adaptively unearthing the best cloud configurations for big data analytics. In: 14th Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, pp. 469–482 (2017)
Google Scholar
Huynh, N.V., Nguyen, D.N., Dutkiewicz, E.: Optimal and fast real-time resource slicing with deep dueling neural networks. IEEE J. Sel. Areas Commun. 37(6), 1455–1470 (2019)
Article Google Scholar
Xu, G., Xu, C.: Prometheus: online estimation of optimal memory demands for workers in in-memory distributed computation. In: The ACM Symposium on Cloud Computing (SoCC), Santa Clara, California, p. 655 (2017)
Google Scholar
Chen, W., Pi, A., Wang, S., et al.: Pufferfish: container-driven elastic memory management for data-intensive applications. In: the ACM Symposium on Cloud Computing (SoCC), Santa Cruz, CA, USA, pp. 259–271 (2019)
Google Scholar
Klimovic, A., Litz, H., Kozyrakis, C.: Selecta: heterogeneous cloud storage configuration for data analytics. In: 2018 USENIX Annual Technical Conference (ATC), Boston, MA, USA, pp. 759–773 (2018)
Google Scholar
Liu, L., Xu, H.: Elasecutor: elastic executor scheduling in data analytics systems. In: The ACM Symposium on Cloud Computing (SoCC), Carlsbad, CA, USA, pp. 107–120 (2018)
Google Scholar
Peng, G., Wang, H., Dong, J., et al.: Knowledge-based resource allocation for collaborative simulation development in a multi-tenant cloud computing environment. IEEE Trans. Serv. Comput. (TSC) 11(2), 306–317 (2018)
Article Google Scholar
Erradi, A., Iqbal, W., Mahmood, A., et al.: Web application resource requirements estimation based on the workload latent features. IEEE Trans. Serv. Comput. (TSC), 1(2019)
Google Scholar
Kholidy, H.A.: An intelligent swarm based prediction approach for predicting clou computing user resource needs. Comput. Commun. (CC) 151, 133–144 (2020)
Article Google Scholar

Download references

Acknowledgments

This work is supported by The National Key Research and Development Program of China (2019YFB1804502), Key-Area Research and Development Program of Guangdong Province (Grant No.2019B010107001), and National Natural Science Foundation of China (Grant No.61832020, 61702569).

Author information

Authors and Affiliations

School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, Guangdong, 510275, China
Jiayun Lin, Fang Liu, Zhenhua Cai, Zhijie Huang & Nong Xiao
Shenzhen Dapu Microelectronic Co., Ltd., Shenzhen, China
Weijun Li

Authors

Jiayun Lin
View author publications
You can also search for this author in PubMed Google Scholar
Fang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenhua Cai
View author publications
You can also search for this author in PubMed Google Scholar
Zhijie Huang
View author publications
You can also search for this author in PubMed Google Scholar
Weijun Li
View author publications
You can also search for this author in PubMed Google Scholar
Nong Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fang Liu .

Editor information

Editors and Affiliations

Shanghai University, Shanghai, China
Honghao Gao
Xi’an Jiaotong-Liverpool University, Suzhou, China
Xinheng Wang
London South Bank University, London, UK
Muddesar Iqbal
Hangzhou Dianzi University, Hangzhou, China
Yuyu Yin
Zhejiang University, Hangzhou, China
Jianwei Yin
Fudan University, Shanghai, China
Ning Gu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, J., Liu, F., Cai, Z., Huang, Z., Li, W., Xiao, N. (2021). Adaptive Online Estimation of Thrashing-Avoiding Memory Reservations for Long-Lived Containers. In: Gao, H., Wang, X., Iqbal, M., Yin, Y., Yin, J., Gu, N. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2020. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 349. Springer, Cham. https://doi.org/10.1007/978-3-030-67537-0_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-67537-0_37
Published: 22 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67536-3
Online ISBN: 978-3-030-67537-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics