On the effectiveness of a message-driven confidence-driven protocol for guarded software upgrading

https://doi.org/10.1016/S0166-5316(00)00054-7Get rights and content

Abstract

A methodology called guarded software upgrading (GSU) is proposed to accomplish dependable onboard evolution for long-life deep-space missions. The core of the methodology is a low-cost error containment and recovery protocol that escorts an upgraded software component through onboard validation and guarded operation, mitigating the effect of residual faults in the upgraded component. The message-driven confidence-driven (MDCD) nature of the protocol eliminates the need for costly process coordination or atomic action, yet guarantees that the system will reach a consistent global state upon the completion of the rollback or roll-forward actions carried out by individual processes during error recovery. To validate the ability of the MDCD protocol to enhance system reliability when a software component undergoes onboard upgrading in a realistic, non-ideal environment, we conduct a stochastic activity network model-based analysis. The results confirm the effectiveness of the protocol as originally surmised. Moreover, a comparative study reveals that the dynamic confidence-driven approach is superior to static approaches and is the key to the attainment of cost-effectiveness.

Introduction

For NASA’s future deep-space exploration, new generation onboard computing systems should be able to evolve for long-term survival in unsurveyed space environments. Concepts related to evolvability include hardware reconfigurability and software upgradability [1]. Software upgradability permits spacecraft/science functions, along a mission’s long-life span, to be enhanced with respect to performance, accuracy, and dependability.

A challenge that arises from onboard software upgrading is that of guarding the system against failures caused by residual design faults introduced by an upgrade. Experiences show that it is impossible to predict, during ground testing prior to uploading, all possible onboard conditions in an unsurveyed deep-space environment; thus, an upgraded embedded software component can never be guaranteed to have ultra-high reliability. There have been cases in which unprotected software upgrades or evolution caused severe damage to space missions (e.g., see [2], [3]), and the necessity of devising methods for dependable software upgrading was further exemplified by MCI WorldCom’s recent 10-day frame relay outage [4]. The outage began 5 August 1999, 4 weeks after a scheduled upgrade to a new switching software intended to allow the network to handle increased traffic. The incident affected about 15% of MCI WorldCom’s network and 30% of its customers who rely on the high-speed frame relay.

Although researchers have been investigating dependable system upgrade for critical applications (e.g., see [5], [6], which describe two recent projects), the proposed solutions in general all require special effort for developing dedicated system resource redundancy. Due to the severe constraints on cost, mass, and power consumption of the spacecraft, NASA’s deep-space applications would not be able to benefit directly from those solutions. Moreover, new generation onboard computing systems such as the X2000, which is being developed at NASA/JPL, employ distributed architectures [1]. Accordingly, error contamination among interacting processes (caused by residual faults in an upgraded software component) becomes a major concern. However, to the best of our knowledge, methods for error containment and recovery in a distributed environment received little attention from prior work concerning dependable system upgrade (e.g., see [5], [6]) or dynamic program modification (e.g., see [7], [8]).

With the above motivation, we have developed a methodology called guarded software upgrading (GSU) [9]. The methodology is based on a two-stage approach. The first stage is called the onboard validation stage, during which we attempt to build confidence in the new version of a software component through onboard test runs under the real avionics system and environment conditions. The second stage is the guarded operation stage, during which we allow the new version to actually service the mission under the protection of a protocol that intends to mitigate the effect of residual faults in the upgraded component.

Since application-specific techniques are an effective strategy for reducing fault-tolerance cost [10], we exploit the characteristics of our target system and application. To ensure low development cost, we exploit inherent system resource redundancies as the means of fault tolerance. Specifically, we let an old version of the upgraded software component (that is already available to us), in which we have high confidence due to its long onboard execution time, escort the new version through onboard validation and guarded operation. We also make use of the processor that is active in the encountering/fly-by phases and would otherwise be idle in a cruise phase during which onboard software upgrade takes place (i.e., non-dedicated redundancy), allowing concurrent execution of the new and old versions.

To reduce performance cost, we take a crucial step in devising error containment and recovery methods by introducing the “confidence-driven” notion. This notion complements the message-driven (or “communication-induced”) approach employed by a number of existing checkpointing protocols for tolerating hardware faults. In particular, we discriminate between the individual software components with respect to our confidence in their reliability; moreover, at onboard execution time, we dynamically adjust our confidence in the processes corresponding to those software components, according to the knowledge about potential process state contamination caused by errors in a low-confidence component and message passing. The resulting protocol is thus both message-driven and confidence-driven (MDCD). In [11], we described in detail the error containment and recovery algorithms that constitute the protocol. The main purpose of this paper is to evaluate the effectiveness of the protocol with respect to enhancing system reliability during guarded operation.

To account for potential process state contamination and message validity, we adapt the notion of “global state consistency” from the literature concerning checkpointing and rollback recovery for hardware faults [12], [13]. Based on the adapted notion, we have developed theorems and formal proofs to show that the MDCD protocol guarantees that the system will reach a consistent global state upon the completion of the rollback or roll-forward actions carried out by individual processes during error recovery [11]. The global state consistency, which is the most fundamental criterion for correct recovery, will also assure that our target system will be failure-free if the MDCD protocol is run in an ideal execution environment where (1) the “old” software components are truly faultless, (2) error conditions in a process state are always manifested in the messages sent by the corresponding process, and (3) the error detection mechanism employed has a perfect coverage.

On the other hand, as with any other fault-tolerance schemes, the realistic goal of the MDCD protocol is to significantly reduce system failure probability rather than to assure that the system is failure-free, since the ideal execution environment rarely exists in reality. Accordingly, we are motivated to validate, through probabilistic modeling, the protocol’s effectiveness in terms of reliability improvement when the criteria for the ideal execution environment are not satisfied. To accomplish the goal requires a model to capture numerous interdependencies among system attributes. Accordingly, we choose to use stochastic activity networks (SANs) [14] to perform the analysis due to their capability of explicitly representing dependencies among system attributes. The results from the SAN-based evaluation confirm the effectiveness of the protocol as originally surmised. The analysis also provides useful insights about the system behavior resulting from use of the protocol under various conditions in its execution environment. Moreover, we conduct a comparative study to assess two MDCD variants, namely, the MDCD-O and MDCD-P protocols, that are based on static confidence-driven approaches. The assessment results demonstrate that the dynamic confidence-driven approach employed by the MDCD protocol is superior to static approaches and is the key to the attainment of cost-effectiveness.

The remainder of the paper is organized as follows. Section 2 provides an outline of the GSU methodology. Section 3 reviews the MDCD protocol. Section 4 presents a SAN-based analysis that validates the effectiveness of the protocol, followed by Section 5 which compares dynamic and static confidence-driven approaches by assessing two MDCD variants. Section 6 highlights the significance of this effort and discuss our future research.

Section snippets

Overview of GSU framework

The GSU framework is based on the Baseline X2000 First Delivery Architecture [15], which is comprised of three high-performance computing nodes (each of which has a local DRAM) and multiple subsystem microcontroller nodes that interface with a variety of devices. All nodes are connected by a fault-tolerant bus network which complies with the commercial interface standard IEEE 1394, facilitating reliable onboard distributed computing.

Since a scheduled software upgrade is normally conducted

MDCD protocol

The MDCD error containment and recovery protocol is discussed in detail in [11]. In this section, we review the protocol and its properties to illustrate our motivation for the analyses conducted in 4 Analysis of effectiveness, 5 Dynamic vs. static confidence-driven approaches.

Objective

As discussed in the previous section, the MDCD protocol guarantees that the system will reach a consistent global state upon the completion of the rollback or roll-forward actions carried out by individual processes during error recovery. It is worthy noting that the global state consistency can further guarantee that the system will be failure-free if the MDCD protocol is run in an ideal execution environment. By an “ideal execution environment”, we mean an execution environment for the

MDCD variants

Recall that our confidence-driven approach to error containment and recovery is two-tiered: first, we discriminate between software components with respect to our confidence in their reliability, and second, we adjust our confidence in the processes corresponding to those software components during onboard execution. The latter indeed implies that we adjudge the “trustworthiness” of a process in a dynamic manner.

To further assess the effectiveness of the MDCD protocol, potential variants of the

Summary and future work

We have presented a study of the effectiveness of the MDCD protocol, an error containment and recovery protocol for onboard software upgrading. In order to mitigate the effect of residual faults in an upgraded software component, we introduce the idea of confidence-driven, which complements the message-driven approach employed by a number of existing checkpointing protocols for tolerating hardware faults. In particular, we discriminate between the individual software components with respect to

Acknowledgements

The authors are thankful to the anonymous reviewers for their helpful comments. The work reported in this paper was supported in part by NASA Small Business Innovation Research (SBIR) Contract NAS3-99125.

Ann T. Tai received her Ph.D. in computer science from the University of California, Los Angeles, CA. She is the President and a senior scientist of IA Tech, Los Angeles, CA. Prior to 1997, she was associated with SoHaR as a senior research engineer. She was an Assistant Professor at the University of Texas at Dallas during 1993. Her current research interests concern the design, development and evaluation of dependable, affordable and evolvable computing systems, and low-cost error containment

References (19)

  • W.H. Sanders et al.

    The UltraSAN modeling environment

    Perform. Eval.

    (1995)
  • L. Alkalai et al.

    Long-life deep-space applications

    IEEE Comput.

    (1998)
  • J.L. Lions, ARIANE 5 Flight 501 Failure, 1996....
  • A. Avižienis

    Towards systematic design of fault-tolerant systems

    IEEE Comput.

    (1997)
  • J. Rendleman, MCI WorldCom blames Lucent software for outage, in: PC Week, Ziff-Davis, August 16, 1999....
  • L. Sha, J.B. Goodenough, B. Pollak, Simplex architecture: meeting the challenges of using COTS in high-reliability...
  • D. Powell, et al. GUARDS: a generic upgradable architecture for real-time dependable systems, IEEE Trans. Parallel...
  • M.E. Segal, O. Frieder, Dynamic program updating: a software maintenance technique for minimizing software downtime, J....
  • M.E. Segal et al.

    On-the-fly program modification: systems for dynamic updating

    IEEE Software

    (1993)
There are more references available in the full text version of this article.

Cited by (7)

  • Performability analysis of guarded-operation duration: A translation approach for reward model solutions

    2004, Performance Evaluation
    Citation Excerpt :

    On the other hand, the complexity of performability measures for analyzing engineering problems and the dependencies among the system attributes or subsystems that are subject to a joint consideration may prevent us from exploiting those techniques in a straightforward fashion. Hence, performability analysis with the motivation described in the preceding paragraph presents us with greater challenges than the separate dependability and performance studies for GSU we conducted earlier [2,3]. To address the challenges, we propose a model-translation approach that enables us to exploit reward model solution techniques which we would otherwise be unable to utilize.

  • An OBSM method for real time embedded system

    2006, Proceedings - 2006 10th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2006
  • Protecting Distributed Software Upgrades that Involve Message-Passing Interface Changes

    2003, Proceedings - IEEE Computer Society's International Computer Software and Applications Conference
  • Performability analysis of guarded-operation duration: A successive model-translation approach

    2002, Proceedings of the 2002 International Conference on Dependable Systems and Networks
  • Synergistic coordination between software and hardware fault tolerance techniques

    2001, Proceedings of the International Conference on Dependable Systems and Networks
View all citing articles on Scopus

Ann T. Tai received her Ph.D. in computer science from the University of California, Los Angeles, CA. She is the President and a senior scientist of IA Tech, Los Angeles, CA. Prior to 1997, she was associated with SoHaR as a senior research engineer. She was an Assistant Professor at the University of Texas at Dallas during 1993. Her current research interests concern the design, development and evaluation of dependable, affordable and evolvable computing systems, and low-cost error containment and recovery methods. She authored the book, Software Performability: From Concepts to Applications, published by Kluwer Academic Publishers.

Kam S. Tso received his Ph.D. in computer science from the University of California, Los Angeles, CA. From 1986 to 1996, he worked at the Jet Propulsion Laboratory and SoHaR, conducting research and development on robotics systems, fault-tolerant systems, and reliable software. He is currently the Vice President of IA Tech. His research interests include World Wide Web technologies, distributed simulation and collaboration, high performance and dependable real-time software and systems.

Leon Alkalai is the Center Director for the Center for Integrated Space Microsystems, a Center of Excellence at the Jet Propulsion Laboratory, California Institute of Technology. The main focus of the center is the development of advanced microelectronics, micro-avionics, and advanced computing technologies for future deep-space highly miniaturized, autonomous, and intelligent robotic missions. He joined JPL in 1989 after receiving his Ph.D. in computer science from the University of California, Los Angeles, CA. Since then, he has worked on numerous technology development tasks including advanced microelectronics miniaturization, advanced microelectronics packaging, reliable and fault-tolerant architectures. He was also one of the NASA appointed co-leads on the New Millennium Program Integrated Product Development Teams for Microelectronics Systems, a consortium of government, industry, and academia to validate technologies for future NASA missions in the 21st century.

Savio N. Chau is a system engineer at the Jet Propulsion Laboratory. He is currently developing scalable multi-mission avionics system architectures for the X2000 Program. He has been investigating techniques to apply low-cost commercial bus standards and off-the-shelf products in highly reliable systems such as long-life spacecraft. His research areas include scalable distributed system architecture, fault tolerance, and design-for-testability. He received his Ph.D. in computer science from the University of California, Los Angeles, CA. He is a member of Tau Beta Pi and Eta Kappa Nu.

William H. Sanders received his B.S.E. in computer engineering in 1983, his M.S.E. in computer, information, and control engineering in 1985, and his Ph.D. in computer science and engineering in 1988 from the University of Michigan. He is currently a Professor in the Department of Electrical and Computer Engineering and the Coordinated Science Laboratory at the University of Illinois and is a Vice Chair of IFIP Working Group 10.4 on Dependable Computing. In addition, he serves on the editorial board of IEEE Transactions on Reliability. He is a fellow of the IEEE and a member of the IEEE Computer, Communications, and Reliability Societies, as well as the ACM, Sigma Xi, and Eta Kappa Nu. His research interests include performance/dependability evaluation, dependable computing, and reliable distributed systems. He has published more than 75 technical papers in these areas. He was co-Program Chair of the 29th International Symposium on Fault-tolerant Computing (FTCS-29), was program co-Chair of the Sixth IFIP Working Conference on Dependable Computing for Critical Applications, and has served on the program committees of numerous conferences and workshops. He is the developer of two tools for assessing the performability of systems represented as stochastic activity networks: METASAN and UltraSAN. UltraSAN has been distributed widely to industry and academia, and licensed to more than 185 universities, several companies, and NASA for evaluating the performance, dependability, and performability of complex distributed systems.

View full text