Autonomic fault mitigation in embedded systems

doi:10.1016/j.engappai.2004.08.031

Engineering Applications of Artificial Intelligence

Volume 17, Issue 7, October 2004, Pages 711-725

https://doi.org/10.1016/j.engappai.2004.08.031 Get rights and content

Abstract

Autonomy, particularly from a maintenance and fault-management perspective, is an increasingly desirable feature in embedded (and non-embedded) computer systems. The driving factors are several—including increasing pervasiveness of computer systems, cost of failures which could potentially be catastrophic in a wide variety of critical systems, and increasing cost and strain on resources in maintaining systems. A trigger system employed in real-time filtering of particle-collision data is a particularly challenging example of a class of large-scale real-time embedded systems that demand a high degree of fault resilience, due to the large cost of operating the facilities and the potential for loss of irreplaceable data. Traditional redundancy-based approaches are not available due to the limited fault-tolerance budget above the system cost. This paper presents an approach based on model integrated computing that provides a set of tools for the system developer to specify, simulate, and synthesize autonomous fault-mitigative behaviors. A hierarchical, role-based organization of fault managers cleanly delineates the data-processing interactions in the system from the fault-mitigative control interactions. The fault-mitigative behaviors, analogous to autonomous biological systems, are characterized as (1) reflex actions—highly autonomous, localized, and uncoordinated response emanating from a single fault manager at any level of hierarchy, and (2) healing actions—highly coordinated behavior implemented with a sequence of interactions between multiple fault managers. The strength of the approach lies in the specification of these behaviors as coordinated interacting hierarchical concurrent finite-state machines, which makes these behaviors formally analyzable.

Introduction

The increasing pervasiveness, scale, and complexity of embedded (and non-embedded) computer systems, and the resultant increased cost in tuning and maintaining these systems has driven the attention of industrial and academic researchers towards novel approaches in self-management of computer systems. Autonomic computing is one such initiative that is being aggressively pursued by several industry leaders, including IBM, HP, Sun, and Microsoft. Autonomic computing has been defined as a self-managing computing model named after, and patterned on, the human body's autonomic nervous system. According to this definition, an autonomic computing system controls the functioning of computer applications and systems with minimal human intervention, in the same way that the autonomic nervous system regulates body systems without conscious input from the individual.

The research presented in this paper is motivated by the fault-tolerance requirements of a class of large-scale real-time embedded (LSRTE) systems, such as those employed in the high-energy physics online trigger application. A defining characteristic of this class of systems is their sheer scale (thousands of processors), which is comparable to grid computing systems. However, what sets these systems apart from grid systems is the tight timing constraints (hard real-time in milliseconds to microseconds), physical co-location of the processing elements (motivated by the timing requirements), and embedding within a large physical system (trigger is embedded in a particle accelerator/collider system). The sheer scale of these (and similar grid) systems makes the traditional redundancy-based approaches to fault tolerance infeasible, due to budgetary, power, and size constraints. Employing a triple-modular redundancy solution will increase the cost of the system significantly higher than three times a non-tolerant system. Another crucial observation in this class of system is that degraded performance is a viable mode of operation compared to a complete and catastrophic failure of the system in response to component failures.

Autonomic computing offers some interesting opportunities in addressing the fault-tolerance requirements of this class of systems. Consider, for example, the characteristics of an autonomic computing system as defined by IBM described in IBM (2004):

1.
it must maintain comprehensive and specific knowledge about all its components;
2.
it must have the ability to self-configure to suit varying and possibly unpredictable conditions;
3.
it must constantly monitor itself for optimal functioning; it must be self-healing and able to find alternate ways to function when it encounters problems;
4.
it must be able to detect threats and protect itself from them; it must be able to adapt to environmental conditions;
5.
it must be based on open standards rather than proprietary technologies; and
6.
it must anticipate demand while remaining transparent to the user.

Clearly, some of the capabilities listed above are desired in a system that can withstand unpredictable failures. A solution that offers fault tolerance by way of actively and adaptively reconfiguring the system (referred to as fault mitigation)—with a significantly lower overhead compared to the redundancy factor of typical fault-tolerance solution—is viable in the LSRTE class of systems. Unfortunately, the characteristics listed above are notional and till date there are no off-the-shelf solutions that could be quickly customized and applied to the trigger system.

Moreover, while the benefits of an autonomic computing approach are significant, they come at the expense of enhanced complexity and cost in developing an autonomic system. Developing a general purpose infrastructure that supports all the capabilities listed above, in addition to meeting the real-time requirements of the systems of interest would be a challenging undertaking. It would require significant efforts in designing and programming autonomic responses at multiple levels of the software infrastructure, including the application, the middleware, and the operating system. Furthermore, the autonomic responses must be coordinated across the different levels of the software infrastructures and across processor boundaries in an essentially large-scale distributed system.

Therefore, automated tools are required to assist the system developers in managing the complexity in designing an autonomic response system. These tools should offer the following: (1) higher-level abstractions for designing adaptive behaviors, which are easier to manipulate and maintain; (2) ability to analyze and simulate the autonomic responses to assess the systems ability to adapt with respect to different failure scenarios; and (3) ability to synthesize low-level programming artifacts from the higher-level abstractions.

This paper describes one such tool suite that is based on the principles of model integrated computing (MIC) (Sztipanovits, 1998; Nordstrom, 1999; Bapty et al., 2000). The key elements of this approach are a graphical modeling environment (GME) as demonstrated in Ledeczi et al. (2000) that is instantiating a domain-specific modeling language (DSME) and a suite of translators that assist in the transformation of domain models to simulation and low-level programming artifacts.

The rest of the paper is organized as follows: Section 2 describes the core concepts of the autonomic fault-mitigation approach. Section 3 provides details of the modeling environment, simulation of the domain models, and the automatic synthesis of low-level programming artifacts from design models. Section 4 presents a case study in application of the tools to a sub-scale prototype of the BTeV trigger system. An overview of related research activities is presented in Section 5, and finally Section 6 concludes the paper with an evaluation of this research and proposals for future work.

Section snippets

Autonomic fault mitigation

The overarching goals of fault mitigation are threefold:

1.
maintain the maximal application functionality for any set of component failures,
2.
recover from failures as completely and rapidly as possible, and
3.
minimize the system cost.

These goals are contradictory in nature. Maintaining the maximum application functionality for a set of component failures requires redundant resources. However, increasing redundancy increases the cost of the system. Our autonomic fault-mitigation approach carefully

System architecture

The core concepts listed below are realized at multiple levels. A GME allows the system developer to specify the hierarchical organization of fault managers, and also allows specification of the fault-mitigation behavior of the fault managers at each level of the hierarchy. A simulation infrastructure facilitates simulation and understanding of the interaction of this network of interacting fault managers. A runtime infrastructure allows instantiation and deployment of fault managers as

Case study

The tools were tested by implementing a subscale prototype of the BTeV trigger system. The prototype was demonstrated in the Super Computing 2003 (SC2003) conference. The focus of this prototype was to demonstrate the ability of the autonomic fault mitigation in handling a set of pre-defined errors that the physicists typically experience in similar systems. The prototype used the experimental physics and industrial control system (EPICS) to provide an interface for monitoring and controlling

Related research

Our research could be summarized as an application of autonomic computing techniques to provision fault tolerance in large-scale real-time embedded systems using a model-based design approach. Accordingly, a brief summary of leading research initiatives in these areas is presented.

Conclusions

This paper presents tools and prototypes that apply autonomic computing concepts to provision fault tolerance in a class of large-scale real-time embedded systems. A suite of tools have been developed using model-based design principles to design the autonomic computing solution. The modeling language allows an integrated specification of the domain architecture and the autonomic fault-mitigation behaviors. The choice of a state-machine-based formalism for specifying behaviors was motivated by

Acknowledgments

This work was supported by NSF under the ITR grant ACI-0121658. The authors also acknowledge the contribution of other RTES collaboration team members at Fermi Lab, UIUC, Pittsburg, and Syracuse Universities.

Dr. Sandeep Neema is a Research Assistant Professor of Electrical Engineering and Computer Science at the Institute for Software Integrated Systems, Vanderbilt University. His research interests include dynamic adaptation for QoS assurance in distributed real-time embedded systems, model-based design of embedded systems, aspect-oriented program composition techniques, design space exploration and constraint based synthesis of embedded systems, and fault tolerance in large-scale computing

References (22)

D. Harel
Statechartsa visual formalism for complex systems
Science of Computer Programming
(1987)
A. Avizienis
Toward systematic design of fault-tolerant systems
IEEE Computer
(1997)
Avizienis, A., Avizienis, R., 2001. An immune system paradigm for the design of fault-tolerant systems. Workshop for...
T. Bapty et al.
Model-integrated tools for the design of dynamically reconfigurable systems
VLSI System
(2000)
J. Butler et al.
Fault tolerant issues in the BTeV trigger. The future of particle physics
(2001)
S. Campos et al.
Real-time symbolic model checking for discrete time models
D. Fussel et al.
Hierarchical motor diagnosis utilizing structural knowledge and a self-learning neuro fuzzy scheme
IEE Transactions on Industrial Electronics
(2000)
G.J. Holzmann
The model checker SPIN
IEEE Transactions on Software Engineering
(1997)
G.J. Holzmann
The SPIN Model CheckerPrimer and Reference Manual
(2003)
IBM Autonomic Research Webpage...

Z.T. Kalbarczyk et al.

Chameleona software infrastructure for adaptive fault tolerance

IEEE Transactions on Parallel and Distributed Systems

(1999)

Cited by (17)

Engineering issues related to the development of a recommender system in a critical context: Application to interactive cockpits
2019, International Journal of Human Computer Studies
Citation Excerpt :
Fault recovery aims at transforming the system state that contains one or more faults into a state without fault so that the service can still be delivered. Both of them can be achieved through specialized fault-tolerant architectures (such as the COM-MON architecture used by Airbus, Traverse et al., 2004), by adding redundancy or diversity using multiple versions of the same software (such as the triple-triple redundant architecture used by Boeing, Yeh, 1996) or by fault mitigation (reducing the severity of faults using barriers or healing behaviors, Neema et al., 2004). Safety critical systems can only be operated by qualified operators who know the system, the domain, and which have been trained to apply specific procedures (which might be designed and defined by regulatory independent authorities).
Recommender Systems (RS) are nowadays widely used in the area of consumer electronics and home entertainment. They are exploited by large companies (such as Amazon in the area of e-commerce) and used by millions of users (e.g. 93 million for Netflix). With Recommender Systems, users explore possible items of interest, consult details about these items and read explanations about the choices offered, but these tasks are relevant to other application domains too. Recommender Systems can be a powerful option to support operators of critical systems who can be confronted with choosing the right option from a set of alternative (especially in the case of incidents). However, deploying recommender systems in critical contexts requires applying critical systems engineering practices throughout its development. As recommender systems are a special kind of interactive systems, those engineering practices must integrate and reconcile the ones of critical systems engineering and interactive systems engineering. This article presents a comprehensive study of the state of the art in recommender systems engineering which highlights the fact that their engineering is still in its infancy. In the light of Recommender Systems characteristics, we propose a systematic analysis of standards in critical systems engineering and knowledge from dependable computing field to extract a set of requirements that are relevant to Recommender Systems. As those requirements target at different components of a systems, we first propose a generic architecture to decompose Recommender Systems. This generic architecture integrates existing proposals from the Recommender Systems community with current knowledge in interactive systems architectures. In order to engineer Recommender Systems compliant with the entire list of requirements identified, we propose to use a set of complementary integrated model-based approaches from the literature. This approach is illustrated on a large case study about future aircraft alerting systems, which highlights the potential benefits of the usage of recommender systems in critical contexts.
Software development methodology for computer based I&C systems of prototype fast breeder reactor
2015, Nuclear Engineering and Design
Citation Excerpt :
In order to meet the single failure criterion as stipulated by AERB SG D-10 (2005), AERB SG D-25 (2010), implementation of fault tolerance in RTC systems is highly required. Neema et al. (2004) described about the several architectures to achieve fault tolerance such as cold/warm/hot redundancy, active or passive redundancy, n-way redundancy with or without voting. Hot redundancy is generally used when the system must not go down even for a brief moment and also to have seamless changeover from one system to other system.
Prototype Fast Breeder Reactor (PFBR) is sodium cooled reactor which is in the advanced stage of construction in Kalpakkam, India. Versa Module Europa bus based Real Time Computer (RTC) systems are deployed for Instrumentation & Control of PFBR. RTC systems have to perform safety functions within the stipulated time which calls for highly dependable software. Hence, well defined software development methodology is adopted for RTC systems starting from the requirement capture phase till the final validation of the software product. V-model is used for software development. IEC 60880 standard and AERB SG D-25 guideline are followed at each phase of software development. Requirements documents and design documents are prepared as per IEEE standards. Defensive programming strategies are followed for software development using C language. Verification and validation (V&V) of documents and software are carried out at each phase by independent V&V committee. Computer aided software engineering tools are used for software modelling, checking for MISRA C compliance and to carry out static and dynamic analysis. Various software metrics such as cyclomatic complexity, nesting depth and comment to code are checked. Test cases are generated using equivalence class partitioning, boundary value analysis and cause and effect graphing techniques. System integration testing is carried out wherein functional and performance requirements of the system are monitored.
Fault tolerant distributed real time computer systems for I&C of prototype fast breeder reactor
2014, Nuclear Engineering and Design
Citation Excerpt :
When two or more systems are used with diverse technologies for the same application is called diversity, whereas two or more similar systems are used for the same application is called redundancy. Neema et al. (2004) described about the several ways to achieve fault tolerance such as cold/warm/hot redundancy, active or passive redundancy, n-way redundancy with or without voting. Cold/warm redundancy cannot meet the real time dead line.
Prototype fast breeder reactor (PFBR) is in the advanced stage of construction at Kalpakkam, India. Three-tier architecture is adopted for instrumentation & control (I&C) of PFBR wherein bottom tier consists of real time computer (RTC) systems, middle tier consists of process computers and top tier constitutes of display stations. These RTC systems are geographically distributed and networked together with process computers and display stations. Hot standby architecture comprising of dual redundant RTC systems with switch over logic system is deployed in order to achieve fault tolerance. Fault tolerant dual redundant network connectivity is provided in each RTC system and TCP/IP protocol is selected for network communication. In order to assess the performance of distributed RTC systems, scaled down model was developed with 9 representative systems and nearly 15% of I&C signals of PFBR were connected and monitored. Functional and performance testing were carried out for each RTC system and the fault tolerant characteristics were studied by creating various faults into the system and observed the performance. Various quality of service parameters like connection establishment delay, priority parameter, transit delay, throughput, residual error ratio, etc., are critically studied for the network.
How to learn from the resilience of Human-Machine Systems?
2013, Engineering Applications of Artificial Intelligence
Citation Excerpt :
The concept of resilience has also been developed in the field of ecology and is used to characterise natural systems that tend to maintain their integrity when subjected to disturbances (Ludwig et al., 1997). It has generated a lot of interest in different scientific communities and has been applied to psychology, psychiatry (Goussé, 2005), sociology, economy, biology (Orwin and Wardle, 2004; Pérez-España and Sánchez, 2001), computer sciences (Chen et al., 2007; Nakayama et al., 2007; Luo and Yang, 2002), and automation (Tianfield and Unland, 2004; Neema et al., 2004; Numanoglu et al., 2006). Psychological resilience is linked to the invulnerability theory (i.e., the positive capacity of people to cope with trauma and to bounce back).
This paper proposes a functional architecture to learn from resilience. First, it defines the concept of resilience applied to Human–Machine System (HMS) in terms of safety management for perturbations and proposes some indicators to assess this resilience. Local and global indicators for evaluating human–machine resilience are used for several criteria. A multi-criteria resilience approach is then developed in order to monitor the evolution of local and global resilience. The resilience indicators are the possible inputs of a learning system that is capable of producing several outputs, such as predictions of the possible evolutions of the system's resilience and possible alternatives for human operators to control resilience. Our system has a feedback–feedforward architecture and is capable of learning from the resilience indicators. A practical example is explained in detail to illustrate the feasibility of such prediction.
Towards autonomic computing systems
2004, Engineering Applications of Artificial Intelligence
Firstly, an in-depth analysis is presented on the concept of autonomic computing and a generic architecture is put forward for autonomic computing systems. Then, a taxonomy is put forward for system adaptations and the concept and architecture are presented for each category of system adaptations. Thirdly, a brief survey is presented upon the states of the art of autonomic computing. Finally, a conclusion is drawn that the essence of autonomic computing is the automation of system adaptations for computing systems, which validates the synonymity between autonomic computing and adaptive computing, and comments are made upon the challenge of autonomic computing systems and the intrinsic differences between autonomic computing systems and conventional feedback control systems.
On Connected Autonomous Vehicles with Unknown Human Driven Vehicles Effects Using Transmissibility Operators
2023, IEEE Transactions on Automation Science and Engineering

View all citing articles on Scopus

Dr. Ted Bapty is a Senior Research Scientist and Research Assistant Professor at the Institute for Software Integrated Systems, Vanderbilt University. His research interests include model-based systems, hardware/software co-design, fault-tolerant systems, program composition, embedded high-performance computing, aspect-oriented program composition, and real-time systems. He received his Ph.D. from Vanderbilt University in 1995.

Shweta Shetty is a Staff Engineer at the Institute for Software Integrated Systems, Vanderbilt University. Her research interests include model-based design for embedded systems, model-driven architecture, fault-tolerant systems, and distributed real-time systems. She received her Master's from Vanderbilt University in 2004.

Steven Nordstrom is a graduate student at the Institute for Software Integrated Systems, Vanderbilt University. His research interests include: hardware/software co-design, fault-tolerant systems, and real-time embedded systems. He received his Master's from Vanderbilt University in 2003.

View full text

Autonomic fault mitigation in embedded systems

Abstract

Introduction

Section snippets

Autonomic fault mitigation

System architecture

Case study

Related research

Conclusions

Acknowledgments

Science of Computer Programming

Toward systematic design of fault-tolerant systems

IEEE Computer

Model-integrated tools for the design of dynamically reconfigurable systems

VLSI System

Fault tolerant issues in the BTeV trigger. The future of particle physics

Real-time symbolic model checking for discrete time models

Hierarchical motor diagnosis utilizing structural knowledge and a self-learning neuro fuzzy scheme

IEE Transactions on Industrial Electronics

The model checker SPIN

IEEE Transactions on Software Engineering

The SPIN Model CheckerPrimer and Reference Manual

Chameleona software infrastructure for adaptive fault tolerance

IEEE Transactions on Parallel and Distributed Systems