Performability analysis of guarded-operation duration: a translation approach for reward model solutions

doi:10.1016/S0166-5316(03)00138-X

Performance Evaluation

Volume 56, Issues 1–4, March 2004, Pages 249-276

https://doi.org/10.1016/S0166-5316(03)00138-X Get rights and content

Abstract

Performability measures are often defined for analyzing the worth of fault-tolerant systems whose performance is gracefully degradable. Accordingly, performability evaluation is inherently well suited for application of reward model solution techniques. On the other hand, the complexity of performability evaluation for solving engineering problems may prevent us from utilizing those techniques directly, suggesting the need for approaches that would enable us to exploit reward model solution techniques through problem transformation. In this paper, we present a performability modeling effort that analyzes the guarded-operation duration for onboard software upgrading. More specifically, we define a “performability index” Y that quantifies the extent to which the guarded operation with a duration φ reduces the expected total performance degradation. In order to solve for Y, we progressively translate its formulation until it becomes an aggregate of constituent measures conducive to efficient reward model solutions. Based on the reward-mapping-enabled intermediate model, we specify reward structures in the composite base model which is built on three stochastic activity network reward models. We describe the model-translation approach and show its feasibility for design-oriented performability modeling.

Introduction

In order to protect an evolvable, distributed embedded system for long-life missions against the adverse effects of design faults introduced by an onboard software upgrade, a methodology called guarded software upgrading (GSU) has been developed [1], [2], [3]. The GSU methodology is supported by a message-driven confidence-driven (MDCD) protocol that enables effective and efficient use of checkpointing and acceptance test techniques for error containment and recovery. More specifically, the MDCD protocol is responsible for ensuring that the system functions properly after a software component is replaced by an updated version during a mission, while allowing the updated component to interact freely with other components in the system. The period during which the system is under the escort of the MDCD protocol is called “guarded operation”.

Guarded operation thus permits an upgraded software component to start its service to the mission in a seamless fashion, and, if the escorting process determines that the upgraded component is not sufficiently reliable and thus imposes an unacceptable risk to the mission, ensures that the system will be safely downgraded back by replacing the upgraded software component with an earlier version. It is anticipated that sensible use of this escorting process will minimize the expected total performance degradation, which comprises: (1) the performance penalty due to design-fault-caused failure, and (2) the performance reduction due to the overhead of the safeguard activities. Accordingly, an important design parameter is the duration of the guarded operation φ, as the total performance degradation is directly influenced by the length of the escorting process. In turn, this suggests that a performability analysis [4] is pertinent to the engineering decision-making.

Performability modeling often implies the need to consider a broad spectrum of system attributes simultaneously and assess their collective effect on the benefit from the system or the worth of a mission the system intends to accomplish. Accordingly, performability evaluation is inherently well suited for the applications of: (1) reward model solution techniques (see [5], [6], [7], [8] for example), (2) methods for hierarchical or hybrid composition (see [9], [10] for example), and behavioral decomposition (see [11], [12] for example), and (3) tools that implement those modeling techniques (see [13], [14] for example). On the other hand, the complexity of performability measures for analyzing engineering problems and the dependencies among the system attributes or subsystems that are subject to a joint consideration may prevent us from exploiting those techniques in a straightforward fashion. Hence, performability analysis with the motivation described in the preceding paragraph presents us with greater challenges than the separate dependability and performance studies for GSU we conducted earlier [2], [3].

To address the challenges, we propose a model-translation approach that enables us to exploit reward model solution techniques which we would otherwise be unable to utilize. Rather than attempt to map the performability measure directly to a single reward structure in a monolithic model, we transform the problem of solving a complex performability measure into that of evaluating several constituent reward variables, each of which can be easily mapped to a reward structure and thereby evaluated efficiently using any software tool that supports reward model solutions.

In particular, we first define a “performability index” Y, that quantifies the extent to which the guarded operation with a duration φ reduces the expected total performance degradation, relative to the case in which guarded operation is completely absent. For clarity and simplicity of the design-oriented model, we allow Y to be formulated at a high level of abstraction. In order to solve for Y efficiently, we choose not to elaborate its formulation directly or expand the design-oriented model into a monolithic, state-space based model. Instead, we translate the model progressively, through analytic manipulation, into an evaluation-oriented form that is an aggregate of constituent measures conducive to reward model solutions. Based on this intermediate, reward-mapping-enabled model, we take the final step to specify reward structures in the composite base model, which is built on three measure-adaptive stochastic activity network (SAN) [15] reward models.

As with behavioral decomposition methods and hierarchical composition techniques, our model-translation approach permits us to avoid dealing with a model that is too complex to allow direct derivation of a closed-form solution. Whereas the most important relationship between those previously developed techniques and our approach is that successive model translation is intended to enable the application of techniques for reward model solutions, behavioral decomposition, and hierarchical/hybrid composition to performability modeling problems in which: (1) clear boundaries among “subsystems” or system properties could not be perceived from the viewpoint of the original problem formulation, or (2) the mathematical implications (to the performability measure) of system behavior may not become apparent until we elaborate the formulation of the problem to a certain degree. More generally, the process of transforming the problem of solving a complex performability measure into that of evaluating constituent reward variables naturally enables us to utilize those existing, efficient modeling techniques and tools that we would be unable to exploit without model translation, widening the scope of their applicability.

The next section provides an overview of the GSU methodology and a description of guarded operation. Section 3 defines and formulates the performability measure. Section 4 explains the translation process in detail, followed by Section 5, which shows how the reward structures are specified in the SAN models. Section 6 presents an analysis of optimal guarded-operation duration. The paper is concluded in Section 7, which summaries what we have accomplished.

Section snippets

Review of guarded software upgrading

The development of the GSU methodology was motivated by the challenge of guarding an embedded system against the adverse effects of design faults introduced by onboard software upgrades [1], [3]. The performability study presented in this paper assumes that the underlying embedded system consists of three computing nodes. (This assumption is consistent with the current architecture of the Future Deliveries Testbed at JPL.

Definition

We define a performability measure that will help us choose the appropriate duration of guarded operation φ. More specifically, φ will be determined based on the value of the performability measure that quantifies the expected total performance degradation reduction resulting from guarded operation.

As mentioned in Section 1, we consider two types of performance degradation, namely

1.
[1.] the performance degradation due to design-fault-caused failure, and
2.
[2.] the performance degradation caused by

Translation for reward model solutions

With the motivation described at the end of the previous section, we develop an approach that translates the design-oriented model successively until it reaches a stage at which the final solution of Y becomes a simple function of “constituent measures”, each of which can be directly mapped to a reward structure. Fig. 3 illustrates the process of successive model translation. As shown by the diagram, translation proceeds along two branches: one for solving E[W₀] and one for solving E[W_φ] (which

SAN reward model solutions for constituent measures

We use SANs to realize the final step of model translation. This choice is based on the following factors: (1) SANs have high-level language constructs that facilitate marking-dependent model specifications and representation of dependencies among system attributes, and (2) the UltraSAN tool provides convenient specification capabilities for defining reward structures [13], and (3) by adopting and making minor modifications to the SAN models we developed for our previous (separate)

Evaluation results

Applying the SAN reward models described in Section 5 and using UltraSAN, we evaluate the performability index Y. Before we proceed to discuss the numerical results, we define the following notation:

μ_new

fault-manifestation rate of the process corresponding to the newly upgraded software version;

μ_old

fault-manifestation rate of a process corresponding to an old software version;

coverage of an AT;

message-sending rate of a process;

p_ext

probability that the message a process intends to send is an

Concluding remarks

We have conducted a model-based performability study that analyzes the guarded-operation duration for onboard software upgrading. By translating a design-oriented model into an evaluation-oriented model, we are able to reach a reward model solution for performability index Y that supports the decision on the duration of guarded operation.

It is always desirable to directly apply efficient analytic techniques and existing tools for solving modeling problems. In practice, however, there are cases

Acknowledgements

The authors are thankful to the anonymous reviewers to whom most revisions should be credited. The work reported in this paper was supported in part by NASA Small Business Innovation Research (SBIR) contract NAS3-99125.

Ann T. Tai received her Ph.D. in Computer Science from the University of California, Los Angeles. She is the President and a Sr. Scientist of IA Tech, Inc., Los Angeles, CA. Prior to 1997, she was associated with SoHaR Incorporated as a Sr. Research Engineer. She was an Assistant Professor at the University of Texas at Dallas during 1993. Her current research interests concern the design, development, and evaluation of dependable computer systems, error containment and recovery algorithms for

References (20)

A.T. Tai et al.
On the effectiveness of a message-driven confidence-driven protocol for guarded software upgrading
Perform. Eval.
(2001)
A.P.A. van Moorsel et al.
Probabilistic evaluation for the analytical solution of large Markov chains: algorithms and tool support
Microelectron. Reliab.
(1996)
W.H. Sanders et al.
The UltraSAN modeling environment
Perform. Eval.
(1995)
A.T. Tai, K.S. Tso, L. Alkalai, S.N. Chau, W.H. Sanders, On low-cost error containment and recovery methods for guarded...
A.T. Tai et al.
Low-cost error containment and recovery for onboard guarded software upgrading and beyond
IEEE Trans. Comput.
(2002)
J.F. Meyer
On evaluating the performability of degradable computing systems
IEEE Trans. Comput.
(1980)
R.A. Howard, Dynamic Probabilistic Systems, vol. II, Semi-Markov and Decision Processes, Wiley, New York,...
W.H. Sanders, J.F. Meyer, A unified approach for specifying measures of performance, dependability, and performability,...
G. Ciardo, A. Blackmore, P.F. Chimento, J. Muppala, K.S. Trivedi, Automated generation and analysis of Markov reward...
S. Rácz, M. Telek, Performability analysis of Markov reward models with rate and impulse reward, in: M.S.B. Plateau, W....

There are more references available in the full text version of this article.

Cited by (2)

Translucent replication for service level assurance
2009, High Assurance Services Computing
Service-level enforcement in web-services-based systems
2009, International Journal of Web and Grid Services

William H. Sanders received his B.S.E. in Computer Engineering (1983), his M.S.E. in Computer, Information, and Control Engineering (1985), and his Ph.D. in Computer Science and Engineering (1988) from the University of Michigan. He is currently a Professor in the Department of Electrical and Computer Engineering and the Coordinated Science Laboratory at the University of Illinois. He is Chair of the IEEE TC on Fault-tolerant Computing and Vice-Chair of IFIP Working Group 10.4 on Dependable Computing. In addition, he serves on the Board of Directors of ACM Sigmetrics and the Editorial Board of IEEE Transactions on Reliability. He is a Fellow of the IEEE and a Member of the IEEE Computer, Communications, and Reliability Societies, as well as the ACM, IFIP Working Group 10.4 on Dependable Computing, Sigma Xi, and Eta Kappa Nu.

Dr. Sanders’s research interests include performance/dependability evaluation, dependable computing, and reliable distributed systems. He has published more than 100 technical papers in these areas. He was Co-Program Chair of the 29th International Symposium on Fault-tolerant Computing (FTCS-29), was program Co-Chair of the Sixth IFIP Working Conference on Dependable Computing for Critical Applications, and has served on the program committees of numerous conferences and workshops. He is a co-developer of three tools for assessing the performability of systems represented as stochastic activity networks: METASAN, UltraSAN, and Möbius. UltraSAN has been distributed widely to industry and academia, and licensed to more than 200 universities, several companies, and NASA for evaluating the performance, dependability, and performability of complex distributed systems. He is also a Co-Developer of the Loki distributed system fault injector and the AQuA middleware for providing dependability to distributed object-oriented applications.

Leon Alkalai is the Center Director for the Center for Integrated Space Microsystems, a center of excellence at the Jet Propulsion Laboratory, California Institute of Technology. The main focus of the center is the development of advanced microelectronics, micro-avionics, and advanced computing technologies for future deep-space highly miniaturized, autonomous, and intelligent robotic missions. He joined JPL in 1989 after receiving his Ph.D. in Computer Science from the University of California, Los Angeles. Since then, he has worked on numerous technology development tasks including advanced microelectronics miniaturization, advanced microelectronics packaging, reliable and fault-tolerant architectures. He was also one of the NASA appointed co-leads on the New Millennium Program Integrated Product Development Teams for Microelectronics Systems, a consortium of government, industry, and academia to validate technologies for future NASA missions in the 21st century.

Savio N. Chau received his Ph.D. in Computer Science from the University of California, Los Angeles. He is Principle Engineer and the Supervisor of the Advanced Concepts and Architecture Group at the Jet Propulsion Laboratory. He is currently developing scalable multi-mission avionics system architectures. He has been investigating techniques to apply low-cost commercial bus standards and off-the-shelf products in highly reliable systems such as long-life spacecraft. His research areas include scalable distributed system architecture, fault tolerance, and design-for-testability. He is a Member of Tau Beta Pi and Eta Kappa Nu.

Kam S. Tso received his Ph.D. in Computer Science from the University of California, Los Angeles, M.S. in Electronic Engineering from the Philips International Institute, Eindhoven, The Netherlands, and B.S. in Electronics from the Chinese University of Hong Kong, Hong Kong. From 1986 to 1996, he worked at the Jet Propulsion Laboratory and SoHaR Incorporated, conducting research and development on robotics systems, fault-tolerant systems, and reliable software. He is currently the Vice President of IA Tech, Inc. Dr. Tso’s research interests include World Wide Web technologies, distributed planning and collaboration, high performance and dependable real-time software and systems.

¹: Tel.: +1-217-333-0345.

²: Tel.: +1-818-354-3309.

³: Tel.: +1-310-474-3568.

View full text