Abstract
We present RASE, a full system high performance simulation methodology for simulating complex server applications and server class chip multiprocessors enabled with fine-grain multithreading (CMTs). RASE combines application knowledge, operating system information, and data access patterns with an instruction stream from a highly-tuned, scalable steady-state benchmark [5] [22] to generate multiple representative instruction streams that can be mapped to a variety of CMT configurations. We use execution-driven simulation to generate instruction streams for M processors and store them as instruction trace files (several billion instructions per processor) that can be post-processed and augmented for larger than M processor system simulation. We use SPEC JBB2000, TPC-C, and an XML server benchmark to compare the performance estimates of RASE to a reference prototype CMT system. By varying M, we find that our trace-driven simulation methodology predicts within 5% of the instructions per cycle (IPC) of the reference hardware for the applications. Without post-processing the traces, in the best cases, the performance prediction accuracy degrades to 20-40% of the real IPC for instruction traces that require a high replication factor.
- K. Aingaran, P. Kongetira, et al., "A 32-way Multirhead SPARC® Processor," 16th Hot Chips Symposium, Aug. 2004.Google Scholar
- A. R. Alameldeen, C. J. Mauer, et. al., "Evaluating Non-deterministic Multi-threaded Commerical Workloads," Computer Architecuter Evaluation using Commerical Workloads (CAECW), February 2002.Google Scholar
- A. R. Alameldeen and D. A. Wood, "Variability in Architectural Simulations of Multi-threaded Workloads," 9th Int'l Symp. on High Performance Computer Architecture (HPCA), Feb. 2003. Google ScholarDigital Library
- R. Alameldeen; M. M. K. Martin, et al., "Simulating a $2M commercial server on a $2K PC," Computer, Volume: 36, Issue: 2, Feb. 2003 Pp:50 -- 57 Google ScholarDigital Library
- M. Annavaram, et al. "The Fuzzy Correlation between Code and Performance Predictability," MICRO-37, Dec. 2004 Google ScholarDigital Library
- L. Barroso, K. Gharachorloo, and E. Bugnion, "Memory System Characterization of Commercial Workloads," Proc. of the 25th Annual Int'l Symp. on Computer Architecture (ISCA), June 1998, pp: 3--14. Google ScholarDigital Library
- L. Barroso, K. Gharachorloo, R. McNamara,. et al., "Piranha: a scalable architecture based on single-chip multiprocessing," ISCA-27, June 2000, pp: 282--293. Google ScholarDigital Library
- S. Basu, S. Roy, R. Kumar, T. Fisher, B. E. Blaho, "Peppermint and Sled: tools for evaluating SMP systems based on IA-64 (IPF) processors," International Proceedings on Parallel and Distributed Processing Symposium, IPDPS 2002, 15--19 April 2002, pp: 54--63 Google ScholarDigital Library
- R. Bedichek, "SimNow#8482;: Fast Platform Simulation Purely in Software," 16th Hot Chips Symp., August 2004.Google Scholar
- J. D. Davis, J. Laudon, and K. Olukotun, "Maximizing CMP Throughput with Mediocre Cores," Int'l Conference on Parallel Architectures and Compilation Techniques (PACT), Sept. 2005, pp. 51--62. Google ScholarDigital Library
- F. Eskesen, et al., "Performance Analysis of Simultaneous Multithreading in a PowerPC-based Processor," IBM Research Report, May 2002, RC22454.Google Scholar
- J. Gibson, R. Kunz, et al. "FLASH vs. (Simulated) FLASH: Closing the Simulation Loop", In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp: 49--58, November 2000. Google ScholarDigital Library
- J. Huh, S. W. Keckler and D. Burger, "Exploring the Design Space of Future CMPs," PACT, Sept. 2001 pp. 199--210. Google ScholarDigital Library
- H. Khalid, "Validating trace-driven microarchitectural simulations," Micro, IEEE, Vol: 20, Issue: 6, Nov.-Dec. 2000, Page(s):76--82 Google ScholarDigital Library
- S. Kunkel, B. Armstrong, P. Vitale, "System optimization for OLTP workloads," Micro, IEEE, Volume: 19 Issue: 3, May-June 1999 Page(s): 56--64. Google ScholarDigital Library
- S. Kunkel, et al., "A performance methodology for commercial servers," IBM Journal of Research and Development, Vol. 44, Number 6, 2000. Google ScholarDigital Library
- J. Laudon, A. Gupta, and M. Horowitz, "Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations," Proc. of the 6th Int'l Symp. on Architectural Support for Parallel Languages and Operating Systems (ASPLOS), Oct. 1994, pp: 308--318. Google ScholarDigital Library
- J. Lo, L Barroso, S. Eggers, K. Gharachorloo, et al., "An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors," ISCA-25, Jun 1998, pp: 39--50. Google ScholarDigital Library
- P. Magnusson, M. Christensson, et al., "Simics: A Full System Simulation Platform," Computer, February 2002, pp: 50--58. Google ScholarDigital Library
- P. Ranganathan, K. Gharachorlooet al., "Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors," ASLPOS-8, Oct.1998, pp: 307--318. Google ScholarDigital Library
- T. Sherwood, S. Sair, and B. Calder, "Phase Tracking and Prediction," ISCA-30, June 2003. Google ScholarDigital Library
- L. Spracklen and S. Abrahan, "Chip Multithreading: Opportunities and Challenges," HPCA-11, Feb. 2005 Google ScholarDigital Library
- R. Stets, L. A. Barroso, et al., "A Detailed Comparison of TPC-C versus TPC-B," Third Workshop on CAECW, January 2000.Google Scholar
- Standard Performance Evaluation Corporation, SPEC*, http://www.spec.org, Warrenton, VAGoogle Scholar
- Sun Microsystems Inc., "XML Processing Performance in Java and .Net," http://java.sun.com/performance/reference/whitepapers/XML_Test-1_0.pdfGoogle Scholar
- TransactionProcessing Performance Council, TPC-*, http://www.tpc.org, San Francisco, CAGoogle Scholar
- R. E. Wunderlich, T. F. Wenisch, B. Falsafi, J. C. Hoe, "SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling," ISCA-30, June 2003 Google ScholarDigital Library
- Qin Xiaohan; J. L. Baer, "A comparative study of conservative and optimistic trace-driven simulations," Simulation Symposium, 1995. Proceedings of the 28th Annual, 9-13 April 1995 Page(s): 42--50 Google ScholarDigital Library
Index Terms
- The RASE (Rapid, Accurate Simulation Environment) for chip multiprocessors
Recommendations
A Unitable Computing Architecture for Chip Multiprocessors
This paper proposes a unitable multi-core architecture, called hyperscalar, that can dynamically unite many scalar cores as a larger superscalar processor to accelerate a thread. To accomplish this, this paper proposes the virtual shared register files (...
Instruction Level Parallelism through Microthreading---A Scalable Approach to Chip Multiprocessors
Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP). The most significant problem with this approach ...
Comments