Skip to main content

Adaptive Online Estimation of Thrashing-Avoiding Memory Reservations for Long-Lived Containers

  • Conference paper
  • First Online:
Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2020)

Abstract

Data-intensive computing systems in cloud datacenters create long-lived containers and allocate memory resource for them to execute long-running applications. It is a challenge to exactly estimate how much memory should be reserved for containers to enable smooth application execution and high resource utilization as well. Current state-of-the-art work has two limitations. First, prediction accuracy is restricted by the monotonicity of the iterative search. Second, application performance fluctuates due to the termination conditions. In this paper, we propose two improved strategies based on MEER, called MEER+ and Deep-MEER, which are designed to assist in memory allocation upon resource manager like YARN. MEER+ has one more step of approximation than MEER, to make the iterative search bi-directional and better approach the optimal value. Based on reinforcement learning and rich data, Deep-MEER achieves thrashing-avoiding estimation without involving termination conditions. Based on the different input requirements and advantages, a scheme to adaptively adopt MEER+ and Deep-MEER in cluster life cycle is proposed. We have evaluated MEER+ and Deep-MEER. Our experimental results show that MEER+ and Deep-MEER yield up to 88% and 20% higher accuracy. Moreover, Deep-MEER guarantees stable performance for applications during recurring executions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhang, W., Wang, L., Cheng, Y.: Performance optimization of Lustre file system based on reinforcement learning. J. Comput. Res. Dev. 56(7), 1578–1586 (2019)

    Google Scholar 

  2. Zhao, T., Dong, S., March, V., et al.: Predicting the parallel file system performance via machine learning. J. Comput. Res. Dev. 48(7), 1202–1215 (2011)

    Google Scholar 

  3. Boutin, E., Ekanayake, J., Lin, W., et al.: Apollo: scalable and coordinated scheduling for cloud-scale computing. In: 11th Symposium on Operating Systems Design and Implementation (OSDI), Broomfield, CO, pp. 285–300 (2014)

    Google Scholar 

  4. Zhu, Y., Liu, J., Guo, M., et al.: Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: 2017 Symposium on Cloud Computing (SoCC), Santa Clara, California, pp. 338–350 (2017)

    Google Scholar 

  5. Vavilapalli, V.K., Murthy, A.C., Dougla, C., et al.: Apache Hadoop YARN: yet another resource negotiator. In: 4th Annual Symposium on Cloud Computing (SoCC), Santa Clara, California, no. 5, pp. 1–16 (2013)

    Google Scholar 

  6. Hindman, B., Konwinski, A., Zaharia, M., et al.: Mesos: a platform for fine-grained resource sharing in the data center. In: 8th conference on Networked Systems Design and Implementation (NSDI), Boston, MA, pp. 295–308 (2011)

    Google Scholar 

  7. Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., et al.: Omega: flexible, scalable schedulers for large compute clusters. In: 8th ACM European Conference on Computer Systems (EuroSys), Prague, Czech Republic, pp. 351–364 (2013)

    Google Scholar 

  8. Verma, A., Pedrosa, L., Korupolu, M., et al.: Large-scale cluster management at Google with Borg. In: 10th European Conference on Computer Systems (EuroSys), Bordeaux, France, no. 18, pp. 1–17 (2015)

    Google Scholar 

  9. Burns, B., Grant, B., Oppenheimer, D., et al.: Borg, Omega, and Kubernetes. In: Communications of the ACM, New York, USA, vol. 59, no. 5, pp. 50–57 (2016)

    Google Scholar 

  10. Xu, G., Xu, C.: MEER: online estimation of optimal memory reservations for long lived containers in in-memory cluster computing. In: 39th IEEE International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, pp. 23–34 (2019)

    Google Scholar 

  11. Garefalakis, P., Karanasos, K., Pietzuch, P., et al.: Medea: scheduling of long running applications in shared production clusters. In: 13th EuroSys Conference, Porto, Portugal, no. 4, pp. 1–13 (2018)

    Google Scholar 

  12. Chen, H., Jiang, G., Zhang, H., et al.: Boosting the performance of computing systems through adaptive configuration tuning. In: 2009 ACM Symposium on Applied Computing (SAC), Honolulu, Hawaii, pp. 1045–1049 (2009)

    Google Scholar 

  13. Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: a system for large-scale machine learning. In: 12th conference on Operating Systems Design and Implementation (OSDI), Savannah, GA, USA, pp. 265–283 (2016)

    Google Scholar 

  14. Meng, X., Bradley, J., Yavuz, B., et al.: Mllib: Machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)

    MathSciNet  MATH  Google Scholar 

  15. Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a faulttolerant abstraction for in-memory cluster computing. In: 9th Conference on Networked Systems Design and Implementation (NSDI), San Jose, CA, p. 2 (2012)

    Google Scholar 

  16. Apache flink. http://flink.apache.org. Accessed 30 Mar 2020

  17. Toshniwal, A., Taneja, S., Shukla, A., et al.: Storm@twitter. In: 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, Utah, USA, pp. 147–156 (2014)

    Google Scholar 

  18. Zaharia, M., Das, T., Li, H., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: 24th ACM Symposium on Operating Systems Principles (SOSP), Farminton, Pennsylvania, pp. 423–438 (2013)

    Google Scholar 

  19. Armbrust, M., Xin, R.S., Lian, C., et al.: Spark SQL: relational data processing in spark. In: 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, pp. 1383–1394 (2015)

    Google Scholar 

  20. Kornacker, M., Behm, A., Bittorf, V., et al.: Impala: a modern, open-source SQL engine for hadoop. In: 7th Biennial Conference on Innovative Data Systems Research (CIDR) (2015)

    Google Scholar 

  21. Saha, B., Shah, H., Seth, S., et al.: Apache Tez: a unifying framework for modeling and building data processing applications. In: 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, pp. 1357–1369 (2015)

    Google Scholar 

  22. Gonzalez, J.E., Xin, R.S., Dave, A., et al.: Graphx: graph processing in a distributed dataflow framework. In: 11th Conference on Operating Systems Design and Implementation (OSDI), Broomfield, CO, pp. 599–613 (2014)

    Google Scholar 

  23. Low, Y., Bickson, D., Gonzalez, J., et al.: Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB Endow. 5(8), 716–727 (2012)

    Article  Google Scholar 

  24. Malewicz, G., Austern, M.H., Bik, A.J.C., et al.: Pregel: a system for large-scale graph processing. In: 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA, pp. 135–146 (2010)

    Google Scholar 

  25. Iorgulescu, C., Dinu, F., Raza, A., et al.: Don’t cry over spilled records: memory elasticity of data-parallel applications and its application to cluster scheduling. In: Annual Technical Conference (ATC), pp. 97–109 (2017)

    Google Scholar 

  26. Herodotou, H., Dong, F., Babu, S.: No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: 2nd ACM Symposium on Cloud Computing (SoCC), Cascais, Portugal, no. 18, pp. 1–14 (2011)

    Google Scholar 

  27. Alipourfard, O., Liu, H.H., Chen, J., et al.: Cherrypick: adaptively unearthing the best cloud configurations for big data analytics. In: 14th Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, pp. 469–482 (2017)

    Google Scholar 

  28. Huynh, N.V., Nguyen, D.N., Dutkiewicz, E.: Optimal and fast real-time resource slicing with deep dueling neural networks. IEEE J. Sel. Areas Commun. 37(6), 1455–1470 (2019)

    Article  Google Scholar 

  29. Xu, G., Xu, C.: Prometheus: online estimation of optimal memory demands for workers in in-memory distributed computation. In: The ACM Symposium on Cloud Computing (SoCC), Santa Clara, California, p. 655 (2017)

    Google Scholar 

  30. Chen, W., Pi, A., Wang, S., et al.: Pufferfish: container-driven elastic memory management for data-intensive applications. In: the ACM Symposium on Cloud Computing (SoCC), Santa Cruz, CA, USA, pp. 259–271 (2019)

    Google Scholar 

  31. Klimovic, A., Litz, H., Kozyrakis, C.: Selecta: heterogeneous cloud storage configuration for data analytics. In: 2018 USENIX Annual Technical Conference (ATC), Boston, MA, USA, pp. 759–773 (2018)

    Google Scholar 

  32. Liu, L., Xu, H.: Elasecutor: elastic executor scheduling in data analytics systems. In: The ACM Symposium on Cloud Computing (SoCC), Carlsbad, CA, USA, pp. 107–120 (2018)

    Google Scholar 

  33. Peng, G., Wang, H., Dong, J., et al.: Knowledge-based resource allocation for collaborative simulation development in a multi-tenant cloud computing environment. IEEE Trans. Serv. Comput. (TSC) 11(2), 306–317 (2018)

    Article  Google Scholar 

  34. Erradi, A., Iqbal, W., Mahmood, A., et al.: Web application resource requirements estimation based on the workload latent features. IEEE Trans. Serv. Comput. (TSC), 1(2019)

    Google Scholar 

  35. Kholidy, H.A.: An intelligent swarm based prediction approach for predicting clou computing user resource needs. Comput. Commun. (CC) 151, 133–144 (2020)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by The National Key Research and Development Program of China (2019YFB1804502), Key-Area Research and Development Program of Guangdong Province (Grant No.2019B010107001), and National Natural Science Foundation of China (Grant No.61832020, 61702569).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fang Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, J., Liu, F., Cai, Z., Huang, Z., Li, W., Xiao, N. (2021). Adaptive Online Estimation of Thrashing-Avoiding Memory Reservations for Long-Lived Containers. In: Gao, H., Wang, X., Iqbal, M., Yin, Y., Yin, J., Gu, N. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2020. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 349. Springer, Cham. https://doi.org/10.1007/978-3-030-67537-0_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67537-0_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67536-3

  • Online ISBN: 978-3-030-67537-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics