ABSTRACT
Streamlined configuration management plays a significant role in modern, complex distributed systems. Via mechanisms that promote consistency, repeatability, and transparency, configuration management systems (CMSes) address complexity and aim to increase the efficiency of administrative procedures, including deployment and failure recovery scenarios. Considering the importance of minimizing disruptions in these systems, we design an architecture that increases persistency and reliability of infrastructure management. We present our architecture in the context of hybrid, cluster-cloud environments and describe our highly available implementation that builds upon the open source CMS called Chef and infrastructure-as-a-service cloud resources from Amazon Web Services. We demonstrate how we enabled a smooth transition from the pre-existing single-server configuration to the proposed highly available management system. We summarize our experience with managing a 20-node Linux cluster using this implementation. Our analysis of utilization and cost of necessary cloud resources indicates that the designed system is a low-cost alternative to acquiring additional physical hardware for hardening cluster management. We also highlight the prototype's security and manageability features that are suitable for larger, production-ready deployments.
- B. Schroeder and G. Gibson, 'A Large-Scale Study of Failures in High-Performance Computing Systems', IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 4, pp. 337--350, Jan. 2010. Google ScholarDigital Library
- K. Yamamoto, A. Uno, H. Murai, T. Tsukamoto, F. Shoji, S. Matsui, R. Sekizawa, F. Sueyasu, H. Uchiyama, M. Okamoto, N. Ohgushi, K. Takashina, D. Wakabayashi, Y. Taguchi, and M. Yokokawa, 'The K computer Operations: Experiences and Statistics', Procedia Computer Science, vol. 29, pp. 576--585, Jan. 2014.Google ScholarCross Ref
- P. Marshall, H. Tufo, K. Keahey, D. LaBissoniere, and M. Woitaszek, 'A Large-Scale Elastic Environment for Scientific Computing',Communications in Computer and Information Science, pp. 112--126, Jan. 2013.Google Scholar
- P. Marshall, H. Tufo, and K. Keahey, 'Provisioning Policies for Elastic Computing Environments', 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, Jan. 2012. Google ScholarDigital Library
- 'Open Source Chef Server 11'. {Online}. Available: https://www.chef.io/download-open-source-chef-server-11/. {Accessed: 20-Feb-2015}.Google Scholar
- 'Welcome to Openswan!'. {Online}. Available: https://www.openswan.org/. {Accessed: 20-Feb-2015}.Google Scholar
- 'DRBD: Software Development for High Availability Clusters'. {Online}. Available: http://drbd.linbit.com/. {Accessed: 20-Feb-2015}.Google Scholar
- 'AWS CloudFormation - Configuration Management & Cloud Orchestration'. {Online}. Available: http://aws.amazon.com/cloudformation/. {Accessed: 20-Feb-2015}.Google Scholar
- 'Chef'. {Online}. Available: http://www.opscode.com/chef/. {Accessed: 04-Apr-2015}.Google Scholar
- 'Heatbeat -- Linux-HA'. {Online}. Available: http://linux-ha.org/wiki/Heartbeat. {Accessed: 20-Feb-2015}.Google Scholar
- 'GitHub: ha-chef repository by Dmitry Duplyakin'. {Online}. Available: https://github.com/dmdu/ha-chef/blob/master/scripts/ha_chef_install.sh. {Accessed: 20-Feb-2015}.Google Scholar
- 'Scenario 4: VPC with a Private Subnet Only and Hardware VPN Access - Amazon Virtual Private Cloud'. {Online}. Available: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario4.html. {Accessed: 20-Feb-2015}.Google Scholar
- 'pdsh - Parallel Distributed Shell'. {Online}. Available: https://code.google.com/p/pdsh/. {Accessed: 20-Feb-2015}.Google Scholar
- R. McLay, K. Schulz, W. Barth, and T. Minyard, 'Best practices for the deployment and management of production HPC clusters', State of the Practice Reports on - SC '11, Jan. 2011. Google ScholarDigital Library
- J. Fischer, R. Majumdar, and S. Esmaeilsabzali, 'Engage: A Deployment Management System', Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12, Jan. 2012. Google ScholarDigital Library
- W. Benton, R. Rati, and E. Erlandson, 'Wallaby: A Scalable Semantic Configuration Service for Grids and Clouds', State of the Practice Reports on - SC '11, Jan. 2011. Google ScholarDigital Library
- E. Kim, J. Kim, and J. Koh, 'Convergence in Information and Communication Technology (ICT) Using Patent Analysis', Journal of Information Systems and Technology Management, vol. 11, Jan. 2014.Google ScholarCross Ref
- J. Wettinger, M. Behrendt, T. Binz, U. Breitenbücher, G. Breiter, F. Leymann, S. Moser, I. Schwertle, and T. Spatzier, 'Integrating Configuration Management with Model-Driven Cloud Management Based on TOSCA', Proceedings of the 3rd International Conference on Cloud Computing and Services Science (CLOSER 2013), Jan. 2013Google Scholar
- J. Schroeter, P. Mucha, M. Muth, K. Jugel, and M. Lochau, 'Dynamic configuration management of cloud-based applications', Proceedings of the 16th International Software Product Line Conference on - SPLC '12 -volume 1, 2012. Google ScholarDigital Library
- H. Han, S. Kim, H. Jung, H. Y. Yeom, C. Yoon, J. Park, and Y. Lee, 'A RESTful Approach to the Management of Cloud Infrastructure', 2009 IEEE International Conference on Cloud Computing, 2009. Google ScholarDigital Library
Index Terms
- Architecting a Persistent and Reliable Configuration Management System
Recommendations
Highly available cloud-based cluster management
CCGRID '15: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid ComputingWe present an architecture that increases persistence and reliability of automated infrastructure management in the context of hybrid, cluster-cloud environments. We describe our highly available implementation that builds upon Chef configuration ...
Multicloud Deployment of Computing Clusters for Loosely Coupled MTC Applications
Cloud computing is gaining acceptance in many IT organizations, as an elastic, flexible, and variable-cost way to deploy their service platforms using outsourced resources. Unlike traditional utilities where a single provider scheme is a common practice,...
High Availability Benchmarking for Cloud Management Infrastructure
ICSS '14: Proceedings of the 2014 International Conference on Service SciencesCloud-management infrastructure plays an important role as a part of cloud computing stacks, serving as the resource manager of cloud platforms. The complexity of cloud-management infrastructure makes its high availability (HA) one of the most critical ...
Comments