research-article

A replication structure for efficient and fault-tolerant parallel and distributed simulations

Authors:
Zengxiang Li

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

,
Wentong Cai

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

,
Stephen John Turner

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

,
Ke Pan

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

SpringSim '10: Proceedings of the 2010 Spring Simulation MulticonferenceApril 2010Article No.: 151Pages 1–10https://doi.org/10.1145/1878537.1878695

Published:11 April 2010Publication History

SpringSim '10: Proceedings of the 2010 Spring Simulation Multiconference

Pages 1–10

ABSTRACT

Large scale parallel and distributed simulations (federations) are developed to study complex systems. Their executions are usually computationally intensive, involving a large number of simulation components (federates) which may be developed by different participants and executed at different locations. Hence, it is attractive to provide mechanisms which can accelerate the executions and tolerate the failures of federates. Previously, we have proposed a federate replication structure, which improves simulation performance by replicating federates with alternative synchronization approaches and automatically choosing the fastest replica to represent the federate in the federation execution. In this paper, we will extend the replication structure so that it keeps the advantages of performance enhancement in the presence of failures. Besides presenting the design and implementation details, we also report the experimental results to demonstrate that the extended replication structure can provide fault tolerance while maintaining performance advantages for simulation executions.

References

Agrawal, D. and J. R. Agre (1992). "Recovering from Multiple Process Failures in the Time Warp Mechanism." IEEE Trans. Comput. 41(12), 1504--1514. Google ScholarDigital Library
Berchtold, C. and M. Hezel (2001). "An Architecture for Fault Tolerant HLA-based Simulation." In Procs of the 15th International European Simulation Multi-Conference, pp. 616--620.Google Scholar
Bryant, R. E. (1977). "Simulation of Packet Communication Architecture Computer Systems." Technical report, MIT. Cambridge, MA, USA. Google ScholarDigital Library
Chandy, K. M. and J. Misra (1979). "Distributed Simulation: A Case Study in Design and Verification of Distributed Programs." IEEE Trans. Software Eng. 5(5), 440--452. Google ScholarDigital Library
Chen, D., S. J. Turner, and W. Cai (2006). "A Framework for Robust HLA-based Distributed Simulations." In Procs of the 20th Workshop on Principles of Advanced and Distributed Simulation, pp. 183--192. Google ScholarDigital Library
Cucuzzo, D., S. D'Alessio, F. Quaglia, and P. Romano (2007). "A Lightweight Heuristic-based Mechanism for Collecting Committed Consistent Global States in Optimistic Simulation." In Procs of the 11th International Symposium on Distributed Simulation and Real-Time Applications, pp. 227--234. Google ScholarDigital Library
Damani, O. P. and V. K. Garg (1998). "Fault-tolerant Distributed Simulation." In Procs of the 12th workshop on Parallel and distributed simulation, pp. 38--45. Google ScholarDigital Library
Defense Modeling and Simulation Office. "High Level Architecture RTI 1.3NG Programmer's Guide Version 5."Google Scholar
Eklöf, M., F. Moradi, and R. Ayani (2005). "A Framework for Fault-tolerance in HLA-based Distributed Simulations." In Procs of the 37th conference on Winter simulation, pp. 1182--1189. Google ScholarDigital Library
Foster, I., C. Kesselman, and S. Tuecke (2001). "The Anatomy of the Grid - Enabling Scalable Virtual Organizations." Int. J. High Perform. Comput. Appl. 15(3), 200--222. Google ScholarDigital Library
Fujimoto, R., D. Lunceford, E. Page, and A. Uhrmacher (2002). "Technical Report of the Dagstuhl-seminar Grand Challenges for Modelling and Simulation."Google Scholar
Fujimoto, R. M. (1990). "Performance of Time Warp under Synthetic Workloads." In Procs of the SCS Multiconference on Distributed Simulation, pp. 23--28.Google Scholar
Fujimoto, R. M. (2000). Parallel and Distributed Simulation Systems. Wiley Interscience. Google ScholarDigital Library
Grošelj, B. (1991). "Fault-tolerant Distributed Simulation." In Procs of the 23rd conference on Winter simulation, pp. 637--641. Google ScholarDigital Library
IEEE (2000). Standard 1516 (HLA Rules), 1516.1 (Federate Interface Specification) and 1516.2 (Object Model Template).Google Scholar
Jefferson, D. R. (1985). "Virtual Time." ACM Trans. Program. Lang. Syst. 7(3), 404--425. Google ScholarDigital Library
Kiesling, T. (2003). "Fault-tolerant Distributed Simulation: A Position Paper." Available at http://www.unibw.de/inf4/personen/wm/t_kiesling/misc/ftds-position-paper.pdf.Google Scholar
Li, Z., W. Cai, S. J. Turner, and K. Pan (2007). "Federate Migration in a Service Oriented HLA RTI." In Procs of International Symposium on Distributed Simulation and Real-Time Applications, pp. 113--121. Google ScholarDigital Library
Li, Z., W. Cai, S. J. Turner, and K. Pan (2008). "Improving Performance by Replicating Simulations with Alternative Synchronization Approaches." In Procs of the 40th Conference on Winter Simulation, pp. 1112--1120. Google ScholarDigital Library
Lüthi, J. and C. Berchtold (2000). "Concepts for Dependable Distributed Discrete Event Simulation." In Procs of the 14th European Simulation Multiconference on Simulation and Modelling, pp. 59--66. Google ScholarDigital Library
Lüthi, J. and S. Großmann (2004). "FT-RSS: A Flexible Framework for Fault Tolerant HLA Federations." In Procs of International Conference on Computational Science, pp. 865--872.Google Scholar
Möller, B., M. Karlsson, and B. Löfstrand (2005). "Developing Fault Tolerant Federations Using HLA Evolved." In Procs of the 2005 Spring Simulation Interoperability Workshop, Number 05S-SIW-048.Google Scholar
Pan, K., S. J. Turner, W. Cai, and Z. Li (2007). "A Service Oriented HLA RTI on the Grid." In Procs of International Conference on Web Services, pp. 984--992.Google Scholar
Pan, K., S. J. Turner, W. Cai, and Z. Li (2008). "A Hybrid HLA Time Management Algorithm based on Both Conditional and Unconditional Information." In Procs of 22th Workshop on Parallel and Distributed Simulation, pp. 203--211. Google ScholarDigital Library
Sotomayor, B. (2005). "The Globus Toolkit 4 Programmer's Tutorial." Available via http://gdp.globus.org/gt4-tutorial/.Google Scholar
Stelling, P., C. DeMatteis, I. Foster, C. Kesselman, C. Lee, and G. von Laszewski (1999). "A Fault Detection Service for Wide Area Distributed Computations." Cluster Computing 2(2), 117--128. Google ScholarDigital Library

Recommendations

A Three-Phases Byzantine Fault Tolerance Mechanism for HLA-Based Simulation
DS-RT '10: Proceedings of the 2010 IEEE/ACM 14th International Symposium on Distributed Simulation and Real Time Applications

A large scale HLA-based simulation (federation) is composed of a large number of simulation components (federates), which may be developed by different participants and executed at different locations. Byzantine failures, caused by malicious attacks and ...
Read More
Towards Fault-tolerant HLA-based Distributed Simulations

Large scale High Level Architecture (HLA)-based simulations are built to study complex problems, and they often involve a large number of federates and vast computing resources. Simulation federates running at different locations are subject to failure. ...
Read More
Highly available fault tolerant distributed computing using reflection and replication
ICAC3 '09: Proceedings of the International Conference on Advances in Computing, Communication and Control

High availability is a desired feature of a good distributed system. Replication is a well-known technique to achieve fault tolerance in distributed systems, thereby enhancing availability.

Distributed computing for partitionable system presents a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SpringSim '10: Proceedings of the 2010 Spring Simulation Multiconference
April 2010
1726 pages
ISBN:9781450300698
General Chairs:
Robert McGraw
RAM Laboratories, Inc
,
Eric Imsand
Auburn University
,
Program Chair:
Michael J. Chinni
US Army - RDECOM - ARDEC
Sponsors
In-Cooperation
Publisher
Society for Computer Simulation International
San Diego, CA, United States
Publication History
- Published: 11 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
decoupled federate architecture
fault tolerance
federate replication
parallel and distributed simulation
performance enhancement
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 64
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A replication structure for efficient and fault-tolerant parallel and distributed simulations

SpringSim '10: Proceedings of the 2010 Spring Simulation Multiconference

ABSTRACT

References

Cited By

Recommendations

A Three-Phases Byzantine Fault Tolerance Mechanism for HLA-Based Simulation

Towards Fault-tolerant HLA-based Distributed Simulations

Highly available fault tolerant distributed computing using reflection and replication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A replication structure for efficient and fault-tolerant parallel and distributed simulations

SpringSim '10: Proceedings of the 2010 Spring Simulation Multiconference

ABSTRACT

References

Cited By

Recommendations

A Three-Phases Byzantine Fault Tolerance Mechanism for HLA-Based Simulation

Towards Fault-tolerant HLA-based Distributed Simulations

Highly available fault tolerant distributed computing using reflection and replication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media