Abstract
High performance computer (HPC) is a complex huge system, of which the architecture design meets increasing difficulties and risks. Traditional methods, such as theoretical analysis, component-level simulation and sequential simulation, are not applicable to system-level simulations of HPC systems. Even the parallel simulation using large-scale parallel machines also have many difficulties in scalability, reliability, generality, as well as efficiency. According to the current needs of HPC architecture design, this paper proposes a system-level parallel simulation platform: ArchSim. We first introduce the architecture of ArchSim simulation platform which is composed of a global server (GS), local server agents (LSA) and entities. Secondly, we emphasize some key techniques of ArchSim, including the synchronization protocol, the communication mechanism and the distributed checkpointing/restart mechanism. We then make a synthesized test of some main performance indices of ArchSim with the phold benchmark and analyze the extra overhead generated by ArchSim. Finally, based on ArchSim, we construct a parallel event-driven interconnection network simulator and a system-level simulator for a small scale HPC system with 256 processors. The results of the performance test and HPC system simulations demonstrate that ArchSim can achieve high speedup ratio and high scalability on parallel host machine and support system-level simulations for the architecture design of HPC systems.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Zheng G, Kakulapati G, Kale L V. BigSim: A parallel simulator for performance prediction of extremely large parallel machines. In Proc. the 18th International Parallel and Distributed Processing Symposium, Santa Fe, USA, April 26–30, 2004, p.78.
Saboo N, Singla A K, Unger J M, and Kale L V. Emulating petaflops machines and Blue Gene. In Proc. the 15th International Parallel and Distributed Processing Symposium, San Francisco, USA, April 23–27, 2001, pp.2048–2091.
Caudell T P, Summers K L, Zhou C. à la carte — A Los Alamos computer architecture toolkit for extreme-scale architecture simulation, 2003, http://wwwc3.lanl.gov/parsim.
Moss N. PARSIM: Parallel architecture simulation tool. In Proc. Los Alamos National Laboratory Student Symposium. Aug. 2002.
Springer P L, Brodowicz M, Brunett S et al. Performance analysis of blue Gene/L using parallel discrete event simulation. Technical Report, California Institute of Technology, 2004.
Ceze L, Strauss K, Almasi G et al. Full circle: Simulating Linux clusters on Linux clusters. In Proc. the Fourth LCI International Conference on Linux Clusters: the HPC Revolution 2003, San Jose USA, June 24–26, 2003.
Fujimoto R M, Das S R, Panesar K S. Georgia Tech Time Warp (GTW Version 2.3) programmer’s manual. 1994, http://www.cc.gatech.edu/computing/pads/PAPERS/gtw.ps.
Steinman J S. SPEEDES: Synchronous parallel environment for emulation and discrete-event simulation. Advance in Parallel and Distributed Simulation, SCS Simulation Series, January, 1991, 23(1): 95–103.
Rao D M, Wilsey P A. An ultra-large-scale simulation framework. Journal of Parallel and Distributed Computing, 2002, 62(11): 1670–1693.
Wilmarth T L. POSE: Scalable general-purpose parallel discrete event simulation. Technical Report, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.
Liu J, Nicol D M. Dartmouth scalable simulation framework user’s manual version 3.0. 2001, http://www.crhc.uiuc.edu/~jasonliu/projects/ssf/papers/dassf-manual-3.0.ps.
Perumalla K S. μSik: A micro-kernel for parallel/distributed simulation systems. In Proc. the 19th Workshop on Principles of Advanced and Distributed Simulation, Washington DC, USA, June 1–3, 2005, pp.59–68.
Perumalla K S. Scaling time warp-based discrete event execution to 104 processors on a BlueGene supercomputer. In Proc. the 4th International Conference on Computing Frontiers, Ischia, Italy, May 7–9, 2007, pp.69–76.
Dahmann J S, Fujimoto R M, Weatherly R M. The department of defense high level architecture. In Proc. the 29th Conference on Winter Simulation, Atlanta, USA, December 7–10, 1997, pp.142–149.
Fujimoto R M. Distributed simulation systems. In Proc. 2003 Winter Simulation Conference, Atlanta, USA, December 7–10, 2003, pp.124–134.
Perumalla K S. Parallel and distributed simulation: Traditional techniques and recent advance. In Proc. the 38th Conference on Winter Simulation, Monterey, USA, December 12–16, 2006, pp.84–95.
Duell J. The design and implementation of Berkeley Lab’s Linux checkpoint/restart. Technical Report, Lawrence Berkeley National Laboratory, 2002, http://www.nersc.gov/research/FTG/checkpoint/reports.html.
Roman E. A survey of checkpoint/restart implementations. Technical Report. Lawrence Berkeley National Laboratory Berkeley, 2002.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported by the National High Technology Research and Development 863 Program of China under Grant No. 2007AA01Z117, and the National Basic Research 973 Program of China under Grant No. 2007CB310900.
Rights and permissions
About this article
Cite this article
Huang, YQ., Li, HL., Xie, XH. et al. ArchSim: A System-Level Parallel Simulation Platform for the Architecture Design of High Performance Computer. J. Comput. Sci. Technol. 24, 901–912 (2009). https://doi.org/10.1007/s11390-009-9281-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-009-9281-9