skip to main content
10.1145/1878537.1878695acmotherconferencesArticle/Chapter ViewAbstractPublication PagesspringsimConference Proceedingsconference-collections
research-article

A replication structure for efficient and fault-tolerant parallel and distributed simulations

Published:11 April 2010Publication History

ABSTRACT

Large scale parallel and distributed simulations (federations) are developed to study complex systems. Their executions are usually computationally intensive, involving a large number of simulation components (federates) which may be developed by different participants and executed at different locations. Hence, it is attractive to provide mechanisms which can accelerate the executions and tolerate the failures of federates. Previously, we have proposed a federate replication structure, which improves simulation performance by replicating federates with alternative synchronization approaches and automatically choosing the fastest replica to represent the federate in the federation execution. In this paper, we will extend the replication structure so that it keeps the advantages of performance enhancement in the presence of failures. Besides presenting the design and implementation details, we also report the experimental results to demonstrate that the extended replication structure can provide fault tolerance while maintaining performance advantages for simulation executions.

References

  1. Agrawal, D. and J. R. Agre (1992). "Recovering from Multiple Process Failures in the Time Warp Mechanism." IEEE Trans. Comput. 41(12), 1504--1514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Berchtold, C. and M. Hezel (2001). "An Architecture for Fault Tolerant HLA-based Simulation." In Procs of the 15th International European Simulation Multi-Conference, pp. 616--620.Google ScholarGoogle Scholar
  3. Bryant, R. E. (1977). "Simulation of Packet Communication Architecture Computer Systems." Technical report, MIT. Cambridge, MA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chandy, K. M. and J. Misra (1979). "Distributed Simulation: A Case Study in Design and Verification of Distributed Programs." IEEE Trans. Software Eng. 5(5), 440--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chen, D., S. J. Turner, and W. Cai (2006). "A Framework for Robust HLA-based Distributed Simulations." In Procs of the 20th Workshop on Principles of Advanced and Distributed Simulation, pp. 183--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cucuzzo, D., S. D'Alessio, F. Quaglia, and P. Romano (2007). "A Lightweight Heuristic-based Mechanism for Collecting Committed Consistent Global States in Optimistic Simulation." In Procs of the 11th International Symposium on Distributed Simulation and Real-Time Applications, pp. 227--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Damani, O. P. and V. K. Garg (1998). "Fault-tolerant Distributed Simulation." In Procs of the 12th workshop on Parallel and distributed simulation, pp. 38--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Defense Modeling and Simulation Office. "High Level Architecture RTI 1.3NG Programmer's Guide Version 5."Google ScholarGoogle Scholar
  9. Eklöf, M., F. Moradi, and R. Ayani (2005). "A Framework for Fault-tolerance in HLA-based Distributed Simulations." In Procs of the 37th conference on Winter simulation, pp. 1182--1189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Foster, I., C. Kesselman, and S. Tuecke (2001). "The Anatomy of the Grid - Enabling Scalable Virtual Organizations." Int. J. High Perform. Comput. Appl. 15(3), 200--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fujimoto, R., D. Lunceford, E. Page, and A. Uhrmacher (2002). "Technical Report of the Dagstuhl-seminar Grand Challenges for Modelling and Simulation."Google ScholarGoogle Scholar
  12. Fujimoto, R. M. (1990). "Performance of Time Warp under Synthetic Workloads." In Procs of the SCS Multiconference on Distributed Simulation, pp. 23--28.Google ScholarGoogle Scholar
  13. Fujimoto, R. M. (2000). Parallel and Distributed Simulation Systems. Wiley Interscience. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Grošelj, B. (1991). "Fault-tolerant Distributed Simulation." In Procs of the 23rd conference on Winter simulation, pp. 637--641. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. IEEE (2000). Standard 1516 (HLA Rules), 1516.1 (Federate Interface Specification) and 1516.2 (Object Model Template).Google ScholarGoogle Scholar
  16. Jefferson, D. R. (1985). "Virtual Time." ACM Trans. Program. Lang. Syst. 7(3), 404--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kiesling, T. (2003). "Fault-tolerant Distributed Simulation: A Position Paper." Available at http://www.unibw.de/inf4/personen/wm/t_kiesling/misc/ftds-position-paper.pdf.Google ScholarGoogle Scholar
  18. Li, Z., W. Cai, S. J. Turner, and K. Pan (2007). "Federate Migration in a Service Oriented HLA RTI." In Procs of International Symposium on Distributed Simulation and Real-Time Applications, pp. 113--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Li, Z., W. Cai, S. J. Turner, and K. Pan (2008). "Improving Performance by Replicating Simulations with Alternative Synchronization Approaches." In Procs of the 40th Conference on Winter Simulation, pp. 1112--1120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lüthi, J. and C. Berchtold (2000). "Concepts for Dependable Distributed Discrete Event Simulation." In Procs of the 14th European Simulation Multiconference on Simulation and Modelling, pp. 59--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lüthi, J. and S. Großmann (2004). "FT-RSS: A Flexible Framework for Fault Tolerant HLA Federations." In Procs of International Conference on Computational Science, pp. 865--872.Google ScholarGoogle Scholar
  22. Möller, B., M. Karlsson, and B. Löfstrand (2005). "Developing Fault Tolerant Federations Using HLA Evolved." In Procs of the 2005 Spring Simulation Interoperability Workshop, Number 05S-SIW-048.Google ScholarGoogle Scholar
  23. Pan, K., S. J. Turner, W. Cai, and Z. Li (2007). "A Service Oriented HLA RTI on the Grid." In Procs of International Conference on Web Services, pp. 984--992.Google ScholarGoogle Scholar
  24. Pan, K., S. J. Turner, W. Cai, and Z. Li (2008). "A Hybrid HLA Time Management Algorithm based on Both Conditional and Unconditional Information." In Procs of 22th Workshop on Parallel and Distributed Simulation, pp. 203--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sotomayor, B. (2005). "The Globus Toolkit 4 Programmer's Tutorial." Available via http://gdp.globus.org/gt4-tutorial/.Google ScholarGoogle Scholar
  26. Stelling, P., C. DeMatteis, I. Foster, C. Kesselman, C. Lee, and G. von Laszewski (1999). "A Fault Detection Service for Wide Area Distributed Computations." Cluster Computing 2(2), 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader