ABSTRACT
Large scale parallel and distributed simulations (federations) are developed to study complex systems. Their executions are usually computationally intensive, involving a large number of simulation components (federates) which may be developed by different participants and executed at different locations. Hence, it is attractive to provide mechanisms which can accelerate the executions and tolerate the failures of federates. Previously, we have proposed a federate replication structure, which improves simulation performance by replicating federates with alternative synchronization approaches and automatically choosing the fastest replica to represent the federate in the federation execution. In this paper, we will extend the replication structure so that it keeps the advantages of performance enhancement in the presence of failures. Besides presenting the design and implementation details, we also report the experimental results to demonstrate that the extended replication structure can provide fault tolerance while maintaining performance advantages for simulation executions.
- Agrawal, D. and J. R. Agre (1992). "Recovering from Multiple Process Failures in the Time Warp Mechanism." IEEE Trans. Comput. 41(12), 1504--1514. Google ScholarDigital Library
- Berchtold, C. and M. Hezel (2001). "An Architecture for Fault Tolerant HLA-based Simulation." In Procs of the 15th International European Simulation Multi-Conference, pp. 616--620.Google Scholar
- Bryant, R. E. (1977). "Simulation of Packet Communication Architecture Computer Systems." Technical report, MIT. Cambridge, MA, USA. Google ScholarDigital Library
- Chandy, K. M. and J. Misra (1979). "Distributed Simulation: A Case Study in Design and Verification of Distributed Programs." IEEE Trans. Software Eng. 5(5), 440--452. Google ScholarDigital Library
- Chen, D., S. J. Turner, and W. Cai (2006). "A Framework for Robust HLA-based Distributed Simulations." In Procs of the 20th Workshop on Principles of Advanced and Distributed Simulation, pp. 183--192. Google ScholarDigital Library
- Cucuzzo, D., S. D'Alessio, F. Quaglia, and P. Romano (2007). "A Lightweight Heuristic-based Mechanism for Collecting Committed Consistent Global States in Optimistic Simulation." In Procs of the 11th International Symposium on Distributed Simulation and Real-Time Applications, pp. 227--234. Google ScholarDigital Library
- Damani, O. P. and V. K. Garg (1998). "Fault-tolerant Distributed Simulation." In Procs of the 12th workshop on Parallel and distributed simulation, pp. 38--45. Google ScholarDigital Library
- Defense Modeling and Simulation Office. "High Level Architecture RTI 1.3NG Programmer's Guide Version 5."Google Scholar
- Eklöf, M., F. Moradi, and R. Ayani (2005). "A Framework for Fault-tolerance in HLA-based Distributed Simulations." In Procs of the 37th conference on Winter simulation, pp. 1182--1189. Google ScholarDigital Library
- Foster, I., C. Kesselman, and S. Tuecke (2001). "The Anatomy of the Grid - Enabling Scalable Virtual Organizations." Int. J. High Perform. Comput. Appl. 15(3), 200--222. Google ScholarDigital Library
- Fujimoto, R., D. Lunceford, E. Page, and A. Uhrmacher (2002). "Technical Report of the Dagstuhl-seminar Grand Challenges for Modelling and Simulation."Google Scholar
- Fujimoto, R. M. (1990). "Performance of Time Warp under Synthetic Workloads." In Procs of the SCS Multiconference on Distributed Simulation, pp. 23--28.Google Scholar
- Fujimoto, R. M. (2000). Parallel and Distributed Simulation Systems. Wiley Interscience. Google ScholarDigital Library
- Grošelj, B. (1991). "Fault-tolerant Distributed Simulation." In Procs of the 23rd conference on Winter simulation, pp. 637--641. Google ScholarDigital Library
- IEEE (2000). Standard 1516 (HLA Rules), 1516.1 (Federate Interface Specification) and 1516.2 (Object Model Template).Google Scholar
- Jefferson, D. R. (1985). "Virtual Time." ACM Trans. Program. Lang. Syst. 7(3), 404--425. Google ScholarDigital Library
- Kiesling, T. (2003). "Fault-tolerant Distributed Simulation: A Position Paper." Available at http://www.unibw.de/inf4/personen/wm/t_kiesling/misc/ftds-position-paper.pdf.Google Scholar
- Li, Z., W. Cai, S. J. Turner, and K. Pan (2007). "Federate Migration in a Service Oriented HLA RTI." In Procs of International Symposium on Distributed Simulation and Real-Time Applications, pp. 113--121. Google ScholarDigital Library
- Li, Z., W. Cai, S. J. Turner, and K. Pan (2008). "Improving Performance by Replicating Simulations with Alternative Synchronization Approaches." In Procs of the 40th Conference on Winter Simulation, pp. 1112--1120. Google ScholarDigital Library
- Lüthi, J. and C. Berchtold (2000). "Concepts for Dependable Distributed Discrete Event Simulation." In Procs of the 14th European Simulation Multiconference on Simulation and Modelling, pp. 59--66. Google ScholarDigital Library
- Lüthi, J. and S. Großmann (2004). "FT-RSS: A Flexible Framework for Fault Tolerant HLA Federations." In Procs of International Conference on Computational Science, pp. 865--872.Google Scholar
- Möller, B., M. Karlsson, and B. Löfstrand (2005). "Developing Fault Tolerant Federations Using HLA Evolved." In Procs of the 2005 Spring Simulation Interoperability Workshop, Number 05S-SIW-048.Google Scholar
- Pan, K., S. J. Turner, W. Cai, and Z. Li (2007). "A Service Oriented HLA RTI on the Grid." In Procs of International Conference on Web Services, pp. 984--992.Google Scholar
- Pan, K., S. J. Turner, W. Cai, and Z. Li (2008). "A Hybrid HLA Time Management Algorithm based on Both Conditional and Unconditional Information." In Procs of 22th Workshop on Parallel and Distributed Simulation, pp. 203--211. Google ScholarDigital Library
- Sotomayor, B. (2005). "The Globus Toolkit 4 Programmer's Tutorial." Available via http://gdp.globus.org/gt4-tutorial/.Google Scholar
- Stelling, P., C. DeMatteis, I. Foster, C. Kesselman, C. Lee, and G. von Laszewski (1999). "A Fault Detection Service for Wide Area Distributed Computations." Cluster Computing 2(2), 117--128. Google ScholarDigital Library
Recommendations
A Three-Phases Byzantine Fault Tolerance Mechanism for HLA-Based Simulation
DS-RT '10: Proceedings of the 2010 IEEE/ACM 14th International Symposium on Distributed Simulation and Real Time ApplicationsA large scale HLA-based simulation (federation) is composed of a large number of simulation components (federates), which may be developed by different participants and executed at different locations. Byzantine failures, caused by malicious attacks and ...
Towards Fault-tolerant HLA-based Distributed Simulations
Large scale High Level Architecture (HLA)-based simulations are built to study complex problems, and they often involve a large number of federates and vast computing resources. Simulation federates running at different locations are subject to failure. ...
Highly available fault tolerant distributed computing using reflection and replication
ICAC3 '09: Proceedings of the International Conference on Advances in Computing, Communication and ControlHigh availability is a desired feature of a good distributed system. Replication is a well-known technique to achieve fault tolerance in distributed systems, thereby enhancing availability.
Distributed computing for partitionable system presents a ...
Comments