Skip to main content
Log in

A methodology to assess the availability of next-generation data centers

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Cloud data center providers benefit from software-defined infrastructure once it promotes flexibility, automation, and scalability. The new paradigm of software-defined infrastructure helps facing current management challenges of a large-scale infrastructure, and guarantying service level agreements with established availability levels. Assessing the availability of a data center remains a complex task as it requires gathering information of a complex infrastructure and generating accurate models to estimate its availability. This paper covers this gap by proposing a methodology to automatically acquire data center hardware configuration to assess, through models, its availability. The proposed methodology leverages the emerging standardized Redfish API and relevant modeling frameworks. Through such approach, we analyzed the availability benefits of migrating from a conventional data center infrastructure (named Performance Optimization Data center (POD) with redundant servers) to a next-generation virtual Performance Optimized Data center (named virtual POD (vPOD) composed of a pool of disaggregated hardware resources). Results show that vPOD improves availability compared to conventional data center configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. For more details about the Mercury user manual and the Mercury scripting language, see [40, 41], respectively.

  2. https://github.com/DMTF/Redfish-Interface-Emulator.

  3. The system is considered high available if it presents five 9’s of availability, meaning that its downtime is about only 5.26 min per year.

References

  1. Trivedi KS, Bobbio A (2017) Reliability and availability engineering: modeling, analysis, and applications. Cambridge University Press, Cambridge

    Book  Google Scholar 

  2. British air data center outage feeds outrage at airline cost cuts (2017). http://www.datacenterknowledge.com. Accessed Nov 2018

  3. Al-Yatama A, Ahmad I, Al-Dabbous N (2017) Memory allocation algorithm for cloud services. J Supercomput 73(11):5006–5033

    Article  Google Scholar 

  4. Fard SYZ, Ahmadi MR, Adabi S (2017) A dynamic VM consolidation technique for QOS and energy consumption in cloud environment. J Supercomput 73(10):4347–4368

    Article  Google Scholar 

  5. Han S, Egi N, Panda A, Ratnasamy S, Shi G, Shenker S (2013) Network support for resource disaggregation in next-generation datacenters. In: Proceedings of the Twelfth ACM Workshop on Hot Topics in Networks, p 10. ACM

  6. Li CS, Franke H, Parris C, Abali B, Kesavan M, Chang V (2017) Composable architecture for rack scale big data computing. Future Gener Comput Syst 67:180–193

    Article  Google Scholar 

  7. Fareghzadeh N, Seyyedi MA, Mohsenzadeh M (2019) Toward holistic performance management in clouds: taxonomy, challenges and opportunities. J Supercomput 75(1):272–313

    Article  Google Scholar 

  8. Chen H, Zhu J, Zhang Z, Ma M, Shen X (2017) Real-time workflows oriented online scheduling in uncertain cloud environment. J Supercomput 73(11):4906–4922

    Article  Google Scholar 

  9. Li C, Zhu L, Liu Y, Luo Y (2017) Resource scheduling approach for multimedia cloud content management. J Supercomput 73(12):5150–5172

    Article  Google Scholar 

  10. Addabbo T, Fort A, Mugnaini M, Vignoli V, Simoni E, Mancini M (2016) Availability and reliability modeling of multicore controlled ups for datacenter applications. Reliab Eng Syst Saf 149:56–62. https://doi.org/10.1016/j.ress.2015.12.010

    Article  Google Scholar 

  11. Alissa HA, Nemati K, Sammakia BG, Seymour MJ, Tipton R, Mendo D, Demetriou DW, Schneebeli K (2016) Chip to chiller experimental cooling failure analysis of data centers: the interaction between it and facility. IEEE Trans Compon Packag Manuf Technol 6(9):1361–1378. https://doi.org/10.1109/TCPMT.2016.2599025

    Article  Google Scholar 

  12. Callou G, Maciel P, Tutsch D, Araújo J (2012) Models for dependability and sustainability analysis of data center cooling architectures. In: IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp 1–6. https://doi.org/10.1109/DSNW.2012.6264697

  13. Liu Z, Chen Y, Bash C, Wierman A, Gmach D, Wang Z, Marwah M, Hyser C (2012) Renewable and cooling aware workload management for sustainable data centers. In: Proceedings of the 12th ACM SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’12, pp 175–186. ACM, New York, NY, USA. https://doi.org/10.1145/2254756.2254779

  14. Callou G, Maciel P, Tutsch D, Ferreira J, Araújo J, Souza R (2013) Estimating sustainability impact of high dependable data centers: a comparative study between brazilian and US energy mixes. Computing 95(12):1137–1170. https://doi.org/10.1007/s00607-013-0328-y

    Article  Google Scholar 

  15. Gomes D, Endo P, Gonçalves G, Rosendo D, Santos G, Kelner J, Sadok D, Mahloo M (2017) Evaluating the cooling subsystem availability on a cloud data center. In: IEEE Symposium on Computers and Communications. IEEE

  16. Santos G, Endo P, Gonçalves G, Rosendo D, Gomes D, Kelner J, Sadok D, Mahloo M (2017) Analyzing the it subsystem failure impact on availability of cloud services. In: IEEE Symposium on Computers and Communications. IEEE

  17. Rosendo D, Santos G, Gomes D, Moreira A, Gonçalves G, Endo P, Kelner J, Sadok D, Mahloo M (2017) How to improve cloud services availability? Investigating the impact of power and it subsystems failures. In: HICSS Hawaii International Conference on System Sciences. HICSS

  18. Redfish composability white paper (2017). https://www.dmtf.org/sites/default/files/standards/documents/DSP2050_1.0.0.pdf. Accessed Apr 2018

  19. Cheng J, Grinnemo KJ (2017) Telco distributed DC with transport protocol enhancement for 5G mobile networks: a survey. Karlstads universitet

  20. Intel rack scale design architecture specification (2018) Software v2.3.3

  21. Intel rack scale design architecture (2019). https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/rack-scale-design-architecture-white-paper.pdf. Accessed Mar 2019

  22. Megarac solutions for intel rack scale design standards (2019). https://ami.com/ami_downloads/MegaRAC_Solutions_for_Intel_Rack_Scale_Design_Data_Sheet.pdf. Accessed Mar 2019

  23. Supermicro rack scale design (rsd) solution overview (2019). https://www.supermicro.com/solutions/SRSD.cfm. Accessed Mar 2019

  24. Redfish scalable platforms management api specification (2018) DMTF Redfish DSP0266

  25. Fazlollahtabar H, Akhavan Niaki ST (2017) Integration of fault tree analysis, reliability block diagram and hazard decision tree for industrial robot reliability evaluation. Ind Robot Int J 44(6):754–764

    Article  Google Scholar 

  26. Maciel P, Trivedi K, Matias R, Kim D (2010) Dependability modeling. In: Performance and dependability in service computing: Concepts, Techniques and Research Directions. IGI Global, Hershey, Pennsylvania, USA, 13

  27. Araujo J, Maciel P, Torquato M, Callou G, Andrade E (2014) Availability evaluation of digital library cloud services. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 666–671. IEEE

  28. Kitchin JF (1988) Practical Markov modeling for reliability analysis. In: 1988 Proceedings of the Annual Reliability and Maintainability Symposium, pp 290–296. IEEE

  29. Malhotra M, Reibman A (1993) Selecting and implementing phase approximations for semi-markov models. Stoch Models 9(4):473–506

    Article  MathSciNet  Google Scholar 

  30. Høyland A, Rausand M (2009) System reliability theory: models and statistical methods, vol 420. Wiley, New York

    MATH  Google Scholar 

  31. Vu-Bac N, Lahmer T, Zhuang X, Nguyen-Thoi T, Rabczuk T (2016) A software framework for probabilistic sensitivity analysis for computationally expensive models. Adv Eng Softw 100:19–31

    Article  Google Scholar 

  32. Pianosi F, Beven K, Freer J, Hall JW, Rougier J, Stephenson DB, Wagener T (2016) Sensitivity analysis of environmental models: a systematic review with practical workflow. Environ Model Softw 79:214–232

    Article  Google Scholar 

  33. Hamby D (1994) A review of techniques for parameter sensitivity analysis of environmental models. Environ Monit Assess 32(2):135–154

    Article  Google Scholar 

  34. Andrade E, Nogueira B, Matos R, Callou G, Maciel P (2017) Availability modeling and analysis of a disaster-recovery-as-a-service solution. Computing 99:1–26

    Article  MathSciNet  Google Scholar 

  35. Kumari P, Saleem F, Sill A, Chen Y (2017) Validation of redfish: the scalable platform management standard. In: Companion Proceedings of the 10th International Conference on Utility and Cloud Computing, pp 113–117. ACM

  36. Redfish resource and schema guide (2017) DSP2046 DMTF Redfish

  37. Cassandras CG, Lafortune S (2009) Introduction to discrete event systems. Springer, Berlin

    MATH  Google Scholar 

  38. Verma AK, Ajit S, Karanki DR (2010) Reliability and safety engineering, vol 43. Springer, Berlin

    Book  Google Scholar 

  39. Maciel P, Matos R, Silva B, Figueiredo J, Oliveira D, Fé I, Maciel R, Dantas J (2017) Mercury: Performance and dependability evaluation of systems with exponential, expolynomial, and general distributions. In: 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC), pp 50–57. IEEE

  40. Mercury tool manual v4.7.0 (2019). http://www.modcs.org/wp-content/uploads/tools/Mercury_Tool_Manual_v4.7.0.pdf. Accessed Mar 2019

  41. Oliveira D (2019) The mercury scripting language cookbook. Available at: http://www.modcs.org/?page_id=1703. Accessed Apr 2019

  42. Smith WE, Trivedi KS, Tomek LA, Ackaret J (2008) Availability analysis of blade server systems. IBM Syst J 47(4):621–640

    Article  Google Scholar 

  43. Brosch F, Koziolek H, Buhnova B, Reussner R (2010) Parameterized reliability prediction for component-based software architectures. In: International Conference on the Quality of Software Architectures, pp 36–51. Springer

  44. Gomes D, Santos GL, Rosendo D, Gonçalves G, Moreira A, Kelner J, Sadok D, Endo PT (2019) Measuring the impact of data center failures on a cloud-based emergency medical call system. Concurr Comput Pract Exper. https://doi.org/10.1002/cpe.5156

    Article  Google Scholar 

  45. Cérin C, Coti C, Delort P, Diaz F, Gagnaire M, Gaumer Q, Guillaume N, Lous J, Lubiarz S, Raffaelli J et al (2013) Downtime statistics of current cloud solutions. International Working Group on Cloud Computing Resiliency. Technical Report

  46. Endo PT, Santos GL, Rosendo D, Gomes DM, Moreira A, Kelner J, Sadok D, Gonçalves GE, Mahloo M (2017) Minimizing and managing cloud failures. Computer 50(11):86–90

    Article  Google Scholar 

  47. Jammal M, Kanso A, Heidari P, Shami A (2017) Evaluating high availability-aware deployments using stochastic petri net model and cloud scoring selection tool. IEEE Trans Serv Comput PP:1

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Research, Development and Innovation Center, Ericsson Telecomunicações S.A., Brazil. Authors would like to thank Carolina Cani for her support in our images.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patricia Takako Endo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rosendo, D., Gomes, D., Santos, G.L. et al. A methodology to assess the availability of next-generation data centers. J Supercomput 75, 6361–6385 (2019). https://doi.org/10.1007/s11227-019-02852-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-02852-3

Keywords

Navigation