ABSTRACT
This paper provides a systematic comparison of various characteristics of computationally-intensive workloads. Our analysis focuses on standard HPC benchmarks and representative applications. For the selected workloads we provide a wide range of characterizations based on instruction tracing and hardware counter measurements.
Each workload is analyzed at the instruction level by comparing the dynamic distribution of executed instructions. We also analyze memory access patterns including various aspects of cache utilization and locality properties of address distributions. Since prefetching plays an important role in the performance of computational workloads, we explore the prefetching potential and for parallel workloads we study the sharing properties of memory accesses. For the purpose of completeness, HPC workloads are compared to two commonly used commercial computing benchmarks.
The results of this work show that the HPC application space is surprisingly diverse, with some codes showing similar data sharing and locality properties with commercial applications. The wide range of studies presented in this paper are instrumental in uncovering the diversity of this application space.
- http://www.spec.org.Google Scholar
- D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS parallel benchmarks. International Journal of Supercomputing Applications, 27(2):63--73, 1991.Google ScholarDigital Library
- K. Beyls and E. Hollander. Reuse distance as a metric for cache behavior. In International Conference on Parallel and Distributed Computing Systems, pages 617--662, 2001.Google Scholar
- R. Brown and I. Sharapov. Parallelization of a molecular modeling application: Programmability comparison between OpenMP and MPI. In Workshop on Productivity and Performance in High-End Computing, February 2006.Google Scholar
- R. Bunt and J. Murphy. Measurement of locality and the behaviour of programs. The Computer Journal, 27(3):238--245, 1984. Google ScholarDigital Library
- R. Bunt, J. Murphy, and S. Majumdar. A measure of program locality and its application. In ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 28--40, August 1984. Google ScholarDigital Library
- R. Bunt and C. Williamson. Temporal and spatial locality: A time and place for everything. In International Symposium in Honour of Professor Guenter Haring's 60th Birthday, 2003.Google Scholar
- L. Carrington, A. Snavely, X. Gao, and N. Wolter. Performance prediction framework for scientic applications. In Lecture Notes in Computer Science, 2659, pages 926--935. Springer, January 2003. Google ScholarDigital Library
- F. Darema-Rogers, G. Pfister, and K. So. Memory access patterns of parallel scientific programs. In ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 46--58. ACM Press, 1987. Google ScholarDigital Library
- P. J. Denning. The working set model for program behavior. Commun. ACM, 11(5):323--333, 1968. Google ScholarDigital Library
- P. J. Denning and S. C. Schwartz. Properties of the working-set model. Commun. ACM, 15(3):191--198, 1972. Google ScholarDigital Library
- C. Ding and Y. Zhong. Predicting wholeprogram locality through reuse distance analysis. In ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 2003. Google ScholarDigital Library
- J. Dongarra and P. Luszczek. Introduction to the HPC Challenge benchmark suite. http://icl.cs.utk.edu/hpcc/pubs/. Google ScholarDigital Library
- J. Dongarra, P. Luszczek, and A. Petitet. The linpack benchmark: Past, present and fugure. Concurrency: Practice and Experience, 15:803--820, 2003.Google ScholarCross Ref
- S. J. Eggers. Simulation analysis of data sharing in shared memory multiprocessors. Technical report, University of California at Berkeley, Berkeley, CA, USA, 1989. Google ScholarDigital Library
- S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Transactions on Programming Languages and Systems, 21(4):703--746, 1999. Google ScholarDigital Library
- E. H. Gornish, E. D. Granston, and A. V. Veidenbaum. Compiler-directed data prefetching in multiprocessors with memory hierarchies. In ICS '90: Proceedings of the 4th international conference on Supercomputing, pages 354--368, New York, NY, USA, 1990. ACM Press. Google ScholarDigital Library
- J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, 1990. Google ScholarDigital Library
- T. Johnson, M. Merten, and W. Hwu. Runtime spatial locality detection and optimization. In 30th Annual ACM/IEEE International Symposium on Microarchitecture, pages 57--64, 1997. Google ScholarDigital Library
- K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker. Performance characterization of a quad pentium pro SMP using OLTP workloads. In ISCA, pages 15--26, 1998. Google ScholarDigital Library
- S. Kumar and S. Wilkerson. Exploiting spatial locality in data caches using spatial footprints. In ISCA '98: Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 357--368, 1998. Google ScholarDigital Library
- T. Macke. Nab, a language for molecular manipulation. PhD Thesis, The Scripps Research Institute, 1996.Google Scholar
- J. Peachey, R. Bunt, and C. Colbourn. Towards an intrinsic measure of program locality. In 16th Annual Hawaii International Conference on System Sciences, pages 128--137, 1983.Google Scholar
- K. Rupnow, A. Rodrigues, K. Underwood, and K. Compton. Scientific applications vs. spec-fp: A comparison of program behavior. In ICS'06: Proceedings of the 20th ACM International Conference on Supercomputing, Cairns, Australia, 2006. Google ScholarDigital Library
- I. Sharapov, R. Kroeger, G. Delamarter, R. Cheveresan, and M. Ramsay. A case study in top-down performance estimation for a large-scale parallel application. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, March 2006. Google ScholarDigital Library
- A. J. Smith. Cache memories. ACM Comput. Surv., 14(3):473--530, 1982. Google ScholarDigital Library
- L. Spracklen, Y. Chou, and S. G. Abraham. Effective instruction prefetching in chip multiprocessors for modern commercial applications. In HPCA '05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 225--236, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- E. Strohmaier and H. Shan. Architecture independent performance characterization and benchmarking for scientific applications. In International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2004. Google ScholarDigital Library
- J. Torrellas, M. Lam, and J. Hennessy. False sharing and spatial locality in multiprocessor caches. IEEE Transactions on Computers, 43(6):651--663, 1994. Google ScholarDigital Library
- P. Trancoso, J.-L. Larriba-Pey, Z. Zhang, and J. Torrellas. The memory performance of DSS commercial workloads in shared-memory multiprocessors. In Proc. of the 3rd IEEE Symp. on High-Performance Computer Architecture (HPCA-3), 1997. Google ScholarDigital Library
- R. Uhlig and T. Mudge. Trace-driven memory simulation: A survey. ACM Computing Surveys, 29(2):128--170, 1997. Google ScholarDigital Library
- J. Weinberg, M. McCracken, A. Snavely, and E. Strohmair. Quantifying locality in the memory access patterns of HPC applications. In Supercomputing, 2005. Google ScholarDigital Library
Index Terms
- Characteristics of workloads used in high performance and technical computing
Recommendations
A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC
Cache has long been used to minimize the latency of main memory accesses by storing frequently used data near the processor. Processor performance depends on the underlying cache performance. Therefore, significant research has been done to identify the ...
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
Experience has shown that many widely used benchmarks are poor predictors of the performance of systems running commercial applications. Research into this anomaly has long been hampered by a lack of address traces from representative multi-user ...
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
ASPLOS VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systemsExperience has shown that many widely used benchmarks are poor predictors of the performance of systems running commercial applications. Research into this anomaly has long been hampered by a lack of address traces from representative multi-user ...
Comments