Abstract
Automated systems management solutions aim to reduce the pressure on the administrators of complex, large-scale, distributed systems by enabling the automation of many of the common tasks of management. However, this creates a level of abstraction, which can act as a barrier between the administrator and the elements being controlled. This can impede the transition to new management paradigms required by the increase of off-premise resources and hybrid cloud systems. The resulting loss of control of the managed environment can contribute to a loss of trust in automated systems management solutions and affect their broader use. This paper proposes a novel approach where the administrator can control the automation level on a per task basis. Administrators define a management task as they would perform it directly and allow the solution to identify the triggers that cause the task to be enacted. The solution also allows administrators to define relevant task output that can be analyzed for fault states and enable error recovery without manual intervention. The impact of this approach leads to reduced management effort for the administrator, while retaining controllability and keeping automation costs low, along with reducing the incidence of errors.







Similar content being viewed by others
References
Armburst, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: a berkeley view of cloud computing. Commun. ACM 53(4), 50–58 (2010)
Zhang, Q., Cheng, L., Boutaba, R.: Cloud computing: state-of-the-art and research challenges. J. Internet Serv. Appl. 1(1), 7–18 (2010). doi:10.1007/s13174-010-0007-6
Anderson, E.A.: Researching system administration. Ph.D. thesis, University of California at Berkeley (2002). URL http://www.eecs.berkeley.edu/Pubs/Dissertations/Data/8465.pdf. Accessed 9 Aug 2012
Cfengine: Cfengine: Companies. URL http://www.cfengine.com/use_cases. Accessed 15th Apr 2013
Brown, A.B., Hellerstein, J.L.: Reducing the cost of it operations: is automation always the answer? In: HotOS ’05: Proceedings of the 10th Conference on Hot Topics in Operating Systems (2005)
Duez, P.P., Zuliana, M.J., Jamieson, G.A.: Trust by design: information requirements for appropriate trust in automation. In: CASCON ’06: Proceedings of the 2006 Conference of the Center for Advanced Studies on Collaborative Research (2006)
Velasquez, N.F., Weisband, S.P.: Work practices of system administrators: Implications for tool design. In: CHIMIT ’08: Proceedings of the 2nd ACM Symposium on Computer Human Interaction for Management of Information Technology (2008)
Nolan, R., McFarlan, F.W.: Information technology and the board of directors. Harv. Bus. Rev. 83(10), 96–106 (2005)
Sheridan, T.B., Parasuraman, R.: Human-automation interaction. Rev. Hum. Factors Ergon. 1(1), 80–129 (2006)
IBM: Ibm global services and autonomic computing. White paper, IBM (2002)
Anderson, P., Scobie, A.: LCFG: The next generation. Division of Informatics, University of Edinburgh (2002). URL http://www.inf.ed.ac.uk/publications/online/1145.pdf. Accessed 9 Aug 2012
Garcia Leiva, R., Barroso Lopez, M., Cancio Melia, G., Chardi Marco, B., Cons, L., Poznanski, P., Washbrook, A., Ferro, E., Holt, A.: Quattor: tools and techniques for the configuration, installation and management of large-scale grid computing fabrics. J. Grid Comput. 2(4), 313–322 (2004)
Burgess, M.: A tiny overview of cfengine: convergent maintenance agent. In: MARS/ICINCO ’05: Proceedings of the 1st International Workshop on Multi-Agent and Robotic Systems (2005)
Labs, P.: Puppet documentation. URL http://docs.puppetlabs.com/. Accessed 9 Aug 2012
IBM: An architectural blueprint for autonomic computing. White paper, IBM (2005)
Huebsher, M.C., McCann, J.A.: A survey of autonomic computing—degrees, models, and applications. ACM Comput. Surv. 40(3), 1–28 (2008)
Lanfranchi, G., Della Peruta, P., Perrone, A., Calvanese, D.: Toward a new landscape of systems management in an autonomic computing environment. IBM Syst. J. 42(1), 119–128 (2003)
Herrmann, K., Muhl, G., Geihs, K.: Self management: the solution to complexity or just another problem?. IEEE Distrib. Syst. Online 6(1), 1 (2005)
Barrett, R., Chen, Y.Y.M., Maglio, P.P.: System administrators are users, too: designing workspaces for managing internet-scale systems. In: CHI ’03: Proceedings of the 2003 Conference on Human Factors in Computing Systems, pp. 1068–1069 (2003)
Buchholz, J., Volk, E.: The need for new monitoring and management technologies in large scale computing systems. In: Proceedings of the eChallenges e-2010 Conference (2010)
Bainbridge, L.: Ironies of automation. Automatica 19(6), 775–779 (1983)
David, J.S., Schuff, D., St. Louis, R.: Managing your total it cost of ownership. Commun. ACM 45(1), 101–106 (2002)
Di Nocera, F., Lorenz, B., Parasuraman, R.: Consequences of shifting from one level of automation to another: main effects and their stability. In: Human Factors in Design, Safety and Management, pp. 363–376 (2004)
Chen, X., Mao, Y., Mao, Z.M., Merwe, J.V.d.: Declarative configuration management for complex and dynamic networks. In: Proceedings of ACM CoNext (2010)
Volk, E., Buchholz, J., Wesner, S., Koudela, D., Schmidt, M., Fallenbeck, N., Schwarzkopf, R., Freisleben, B., Isenmann, G., Schwitalla, J.: Towards intelligent management of very large computing systems. In: Proceedings of the International Conference on Competence in High Performance Computing (2010)
Schumm, D., Fehling, C., Karastoyanova, D., Leymann, F., Rütschlin, J.: Processes for human integration in automated cloud application management. Tech. rep., Universität Stuttgart (2012)
Humble, J., Molesky, J.: Why enterprises must adopt devops to enable continuous delivery. Cut. IT J. 24(8), 6–12 (2011)
Ekaette, E., Far, B.: A framework for distributed fault management using intelligent software agents. In: Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering (2003)
Hanemann, A., Sailer, M., Schmitz, D.: Assured service quality by improved fault management. In: ICSOC ’04: Proceedings of the 2nd International Conference on Service Oriented Computing, pp. 183–192 (2004)
Oliveira, F., Tjang, A., Bianchini, R., Martin, R.P., Nguyen, T.D.: Barricade: defending systems against operator mistakes. In: Proceedings of the 5th European Conference on Computer Systems (2010)
Lee, J.D., See, K.A.: Trust in automation: designing for appropriate reliance. Hum. Factors J. Hum. Factors Ergon. Soc. 46(1), 50–80 (2004)
McLarnon, B., Robinson, P., Milligan, P., Sage, P.: Introducing automated management through iteratively increased automation and indicators. In: DANMS ’11: Proceedings of 4th IFIP/IEEE Workshop on Distributed Autonomous Network Management Systems, pp. 1116–1121 (2011)
Dugmore, J., Taylor, S.: Itil v3 and iso/iec 20000. Tech. rep., BSi (2008)
Delaet, T., Joosen, W., Vanbrabant, B.: A survey of system configuration tools. In: LISA ’10: Proceedings of the 24th International Conference on Large Installation System Administration (2010)
Diao, Y., Hellerstein, J.L., Parekh, S., Griffith, R., Kaiser, G.E., Phung, D.: A control theory foundation for self-managing computer systems. IEEE J. Sel. Areas Commun. 23(12), 2213–2222 (2005)
Acknowledgments
This work was carried out with the support of the GEYSERS (FP7-ICT-248657) project funded by the European Commission through the 7th ICT Framework Program. Neither this paper nor any part of its content has been published or accepted for publication elsewhere, nor has it been submitted to any other journal for review.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
McLarnon, B., Robinson, P., Milligan, P. et al. An Iterative Approach to Trustable Systems Management Automation and Fault Handling. J Netw Syst Manage 22, 366–395 (2014). https://doi.org/10.1007/s10922-013-9295-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10922-013-9295-z