Autonomic fault mitigation in embedded systems

https://doi.org/10.1016/j.engappai.2004.08.031Get rights and content

Abstract

Autonomy, particularly from a maintenance and fault-management perspective, is an increasingly desirable feature in embedded (and non-embedded) computer systems. The driving factors are several—including increasing pervasiveness of computer systems, cost of failures which could potentially be catastrophic in a wide variety of critical systems, and increasing cost and strain on resources in maintaining systems. A trigger system employed in real-time filtering of particle-collision data is a particularly challenging example of a class of large-scale real-time embedded systems that demand a high degree of fault resilience, due to the large cost of operating the facilities and the potential for loss of irreplaceable data. Traditional redundancy-based approaches are not available due to the limited fault-tolerance budget above the system cost. This paper presents an approach based on model integrated computing that provides a set of tools for the system developer to specify, simulate, and synthesize autonomous fault-mitigative behaviors. A hierarchical, role-based organization of fault managers cleanly delineates the data-processing interactions in the system from the fault-mitigative control interactions. The fault-mitigative behaviors, analogous to autonomous biological systems, are characterized as (1) reflex actions—highly autonomous, localized, and uncoordinated response emanating from a single fault manager at any level of hierarchy, and (2) healing actions—highly coordinated behavior implemented with a sequence of interactions between multiple fault managers. The strength of the approach lies in the specification of these behaviors as coordinated interacting hierarchical concurrent finite-state machines, which makes these behaviors formally analyzable.

Introduction

The increasing pervasiveness, scale, and complexity of embedded (and non-embedded) computer systems, and the resultant increased cost in tuning and maintaining these systems has driven the attention of industrial and academic researchers towards novel approaches in self-management of computer systems. Autonomic computing is one such initiative that is being aggressively pursued by several industry leaders, including IBM, HP, Sun, and Microsoft. Autonomic computing has been defined as a self-managing computing model named after, and patterned on, the human body's autonomic nervous system. According to this definition, an autonomic computing system controls the functioning of computer applications and systems with minimal human intervention, in the same way that the autonomic nervous system regulates body systems without conscious input from the individual.

The research presented in this paper is motivated by the fault-tolerance requirements of a class of large-scale real-time embedded (LSRTE) systems, such as those employed in the high-energy physics online trigger application. A defining characteristic of this class of systems is their sheer scale (thousands of processors), which is comparable to grid computing systems. However, what sets these systems apart from grid systems is the tight timing constraints (hard real-time in milliseconds to microseconds), physical co-location of the processing elements (motivated by the timing requirements), and embedding within a large physical system (trigger is embedded in a particle accelerator/collider system). The sheer scale of these (and similar grid) systems makes the traditional redundancy-based approaches to fault tolerance infeasible, due to budgetary, power, and size constraints. Employing a triple-modular redundancy solution will increase the cost of the system significantly higher than three times a non-tolerant system. Another crucial observation in this class of system is that degraded performance is a viable mode of operation compared to a complete and catastrophic failure of the system in response to component failures.

Autonomic computing offers some interesting opportunities in addressing the fault-tolerance requirements of this class of systems. Consider, for example, the characteristics of an autonomic computing system as defined by IBM described in IBM (2004):

  • 1.

    it must maintain comprehensive and specific knowledge about all its components;

  • 2.

    it must have the ability to self-configure to suit varying and possibly unpredictable conditions;

  • 3.

    it must constantly monitor itself for optimal functioning; it must be self-healing and able to find alternate ways to function when it encounters problems;

  • 4.

    it must be able to detect threats and protect itself from them; it must be able to adapt to environmental conditions;

  • 5.

    it must be based on open standards rather than proprietary technologies; and

  • 6.

    it must anticipate demand while remaining transparent to the user.

Clearly, some of the capabilities listed above are desired in a system that can withstand unpredictable failures. A solution that offers fault tolerance by way of actively and adaptively reconfiguring the system (referred to as fault mitigation)—with a significantly lower overhead compared to the redundancy factor of typical fault-tolerance solution—is viable in the LSRTE class of systems. Unfortunately, the characteristics listed above are notional and till date there are no off-the-shelf solutions that could be quickly customized and applied to the trigger system.

Moreover, while the benefits of an autonomic computing approach are significant, they come at the expense of enhanced complexity and cost in developing an autonomic system. Developing a general purpose infrastructure that supports all the capabilities listed above, in addition to meeting the real-time requirements of the systems of interest would be a challenging undertaking. It would require significant efforts in designing and programming autonomic responses at multiple levels of the software infrastructure, including the application, the middleware, and the operating system. Furthermore, the autonomic responses must be coordinated across the different levels of the software infrastructures and across processor boundaries in an essentially large-scale distributed system.

Therefore, automated tools are required to assist the system developers in managing the complexity in designing an autonomic response system. These tools should offer the following: (1) higher-level abstractions for designing adaptive behaviors, which are easier to manipulate and maintain; (2) ability to analyze and simulate the autonomic responses to assess the systems ability to adapt with respect to different failure scenarios; and (3) ability to synthesize low-level programming artifacts from the higher-level abstractions.

This paper describes one such tool suite that is based on the principles of model integrated computing (MIC) (Sztipanovits, 1998; Nordstrom, 1999; Bapty et al., 2000). The key elements of this approach are a graphical modeling environment (GME) as demonstrated in Ledeczi et al. (2000) that is instantiating a domain-specific modeling language (DSME) and a suite of translators that assist in the transformation of domain models to simulation and low-level programming artifacts.

The rest of the paper is organized as follows: Section 2 describes the core concepts of the autonomic fault-mitigation approach. Section 3 provides details of the modeling environment, simulation of the domain models, and the automatic synthesis of low-level programming artifacts from design models. Section 4 presents a case study in application of the tools to a sub-scale prototype of the BTeV trigger system. An overview of related research activities is presented in Section 5, and finally Section 6 concludes the paper with an evaluation of this research and proposals for future work.

Section snippets

Autonomic fault mitigation

The overarching goals of fault mitigation are threefold:

  • 1.

    maintain the maximal application functionality for any set of component failures,

  • 2.

    recover from failures as completely and rapidly as possible, and

  • 3.

    minimize the system cost.

These goals are contradictory in nature. Maintaining the maximum application functionality for a set of component failures requires redundant resources. However, increasing redundancy increases the cost of the system. Our autonomic fault-mitigation approach carefully

System architecture

The core concepts listed below are realized at multiple levels. A GME allows the system developer to specify the hierarchical organization of fault managers, and also allows specification of the fault-mitigation behavior of the fault managers at each level of the hierarchy. A simulation infrastructure facilitates simulation and understanding of the interaction of this network of interacting fault managers. A runtime infrastructure allows instantiation and deployment of fault managers as

Case study

The tools were tested by implementing a subscale prototype of the BTeV trigger system. The prototype was demonstrated in the Super Computing 2003 (SC2003) conference. The focus of this prototype was to demonstrate the ability of the autonomic fault mitigation in handling a set of pre-defined errors that the physicists typically experience in similar systems. The prototype used the experimental physics and industrial control system (EPICS) to provide an interface for monitoring and controlling

Related research

Our research could be summarized as an application of autonomic computing techniques to provision fault tolerance in large-scale real-time embedded systems using a model-based design approach. Accordingly, a brief summary of leading research initiatives in these areas is presented.

Conclusions

This paper presents tools and prototypes that apply autonomic computing concepts to provision fault tolerance in a class of large-scale real-time embedded systems. A suite of tools have been developed using model-based design principles to design the autonomic computing solution. The modeling language allows an integrated specification of the domain architecture and the autonomic fault-mitigation behaviors. The choice of a state-machine-based formalism for specifying behaviors was motivated by

Acknowledgments

This work was supported by NSF under the ITR grant ACI-0121658. The authors also acknowledge the contribution of other RTES collaboration team members at Fermi Lab, UIUC, Pittsburg, and Syracuse Universities.

Dr. Sandeep Neema is a Research Assistant Professor of Electrical Engineering and Computer Science at the Institute for Software Integrated Systems, Vanderbilt University. His research interests include dynamic adaptation for QoS assurance in distributed real-time embedded systems, model-based design of embedded systems, aspect-oriented program composition techniques, design space exploration and constraint based synthesis of embedded systems, and fault tolerance in large-scale computing

References (22)

  • D. Harel

    Statechartsa visual formalism for complex systems

    Science of Computer Programming

    (1987)
  • A. Avizienis

    Toward systematic design of fault-tolerant systems

    IEEE Computer

    (1997)
  • Avizienis, A., Avizienis, R., 2001. An immune system paradigm for the design of fault-tolerant systems. Workshop for...
  • T. Bapty et al.

    Model-integrated tools for the design of dynamically reconfigurable systems

    VLSI System

    (2000)
  • J. Butler et al.

    Fault tolerant issues in the BTeV trigger. The future of particle physics

    (2001)
  • S. Campos et al.

    Real-time symbolic model checking for discrete time models

  • D. Fussel et al.

    Hierarchical motor diagnosis utilizing structural knowledge and a self-learning neuro fuzzy scheme

    IEE Transactions on Industrial Electronics

    (2000)
  • G.J. Holzmann

    The model checker SPIN

    IEEE Transactions on Software Engineering

    (1997)
  • G.J. Holzmann

    The SPIN Model CheckerPrimer and Reference Manual

    (2003)
  • IBM Autonomic Research Webpage...
  • Z.T. Kalbarczyk et al.

    Chameleona software infrastructure for adaptive fault tolerance

    IEEE Transactions on Parallel and Distributed Systems

    (1999)
  • Cited by (17)

    • Engineering issues related to the development of a recommender system in a critical context: Application to interactive cockpits

      2019, International Journal of Human Computer Studies
      Citation Excerpt :

      Fault recovery aims at transforming the system state that contains one or more faults into a state without fault so that the service can still be delivered. Both of them can be achieved through specialized fault-tolerant architectures (such as the COM-MON architecture used by Airbus, Traverse et al., 2004), by adding redundancy or diversity using multiple versions of the same software (such as the triple-triple redundant architecture used by Boeing, Yeh, 1996) or by fault mitigation (reducing the severity of faults using barriers or healing behaviors, Neema et al., 2004). Safety critical systems can only be operated by qualified operators who know the system, the domain, and which have been trained to apply specific procedures (which might be designed and defined by regulatory independent authorities).

    • Software development methodology for computer based I&C systems of prototype fast breeder reactor

      2015, Nuclear Engineering and Design
      Citation Excerpt :

      In order to meet the single failure criterion as stipulated by AERB SG D-10 (2005), AERB SG D-25 (2010), implementation of fault tolerance in RTC systems is highly required. Neema et al. (2004) described about the several architectures to achieve fault tolerance such as cold/warm/hot redundancy, active or passive redundancy, n-way redundancy with or without voting. Hot redundancy is generally used when the system must not go down even for a brief moment and also to have seamless changeover from one system to other system.

    • Fault tolerant distributed real time computer systems for I&C of prototype fast breeder reactor

      2014, Nuclear Engineering and Design
      Citation Excerpt :

      When two or more systems are used with diverse technologies for the same application is called diversity, whereas two or more similar systems are used for the same application is called redundancy. Neema et al. (2004) described about the several ways to achieve fault tolerance such as cold/warm/hot redundancy, active or passive redundancy, n-way redundancy with or without voting. Cold/warm redundancy cannot meet the real time dead line.

    • How to learn from the resilience of Human-Machine Systems?

      2013, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      The concept of resilience has also been developed in the field of ecology and is used to characterise natural systems that tend to maintain their integrity when subjected to disturbances (Ludwig et al., 1997). It has generated a lot of interest in different scientific communities and has been applied to psychology, psychiatry (Goussé, 2005), sociology, economy, biology (Orwin and Wardle, 2004; Pérez-España and Sánchez, 2001), computer sciences (Chen et al., 2007; Nakayama et al., 2007; Luo and Yang, 2002), and automation (Tianfield and Unland, 2004; Neema et al., 2004; Numanoglu et al., 2006). Psychological resilience is linked to the invulnerability theory (i.e., the positive capacity of people to cope with trauma and to bounce back).

    • Towards autonomic computing systems

      2004, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus

    Dr. Sandeep Neema is a Research Assistant Professor of Electrical Engineering and Computer Science at the Institute for Software Integrated Systems, Vanderbilt University. His research interests include dynamic adaptation for QoS assurance in distributed real-time embedded systems, model-based design of embedded systems, aspect-oriented program composition techniques, design space exploration and constraint based synthesis of embedded systems, and fault tolerance in large-scale computing clusters. He received his Ph.D. from Vanderbilt University in 2001.

    Dr. Ted Bapty is a Senior Research Scientist and Research Assistant Professor at the Institute for Software Integrated Systems, Vanderbilt University. His research interests include model-based systems, hardware/software co-design, fault-tolerant systems, program composition, embedded high-performance computing, aspect-oriented program composition, and real-time systems. He received his Ph.D. from Vanderbilt University in 1995.

    Shweta Shetty is a Staff Engineer at the Institute for Software Integrated Systems, Vanderbilt University. Her research interests include model-based design for embedded systems, model-driven architecture, fault-tolerant systems, and distributed real-time systems. She received her Master's from Vanderbilt University in 2004.

    Steven Nordstrom is a graduate student at the Institute for Software Integrated Systems, Vanderbilt University. His research interests include: hardware/software co-design, fault-tolerant systems, and real-time embedded systems. He received his Master's from Vanderbilt University in 2003.

    View full text