Abstract
Modern systems present complex memory hierarchies and heterogeneity among cores and processors. As a consequence, efficient programming is challenging. An easy-to-understand performance model, offering guidelines and information about the behaviour of a code, may be useful to alleviate these issues. In this paper, we present two extensions of the well-known Berkeley Roofline Model. The first of these extensions, the Dynamic Roofline Model (DyRM), takes into consideration the complexities of multicore and heterogeneous systems, offering a more detailed view of the evolution of the execution of a code. The second, the 3DyRM, also adds information about the latency of memory accesses to better represent the behaviour on systems with complex memory hierarchies. A set of tools to obtain and represent the models has been implemented. These tools obtain the needed data from hardware counters, with low overhead. Different views are displayed by the tool that can be used to extract the main features of the code. Results of studying, with these tools, the NAS Parallel Benchmarks for OpenMP on two different systems are presented.







Similar content being viewed by others
References
HP (2013) HP Caliper, Rockville. http://www.hp.com/go/hpux-caliper-docs. Accessed 2014
Intel (2012) Intel\(\textregistered \)64 and IA-32 architectures software developer’s manual volume 3B: system programming guide, part 2. http://download.intel.com/products/processor/manual/253669.pdf. Accessed 2014
Intel (2013) Intel VTune performance analyzer. Intel Corporation, Santa Clara. http://software.intel.com/en-us/intel-vtune. Accessed 2014
Intel (2013) Intel ark. http://ark.intel.com/products/64592/. Accessed 2014
Jin H, Frumkin M, Yan J (1999) The OpenMP implementation of NAS parallel benchmarks and its performance. In: Technical report NAS-99-011, NASA Ames Research Center, Moffett Field
Lorenzo OG, Lorenzo JA, Cabaleiro JC, Heras DB, Suarez M, Pichel JC (2011) A study of memory access patterns in irregular parallel codes using hardware counter-based tools. In: Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA), pp 920–923.
Martínez DR, Blanco V, Cabaleiro JC, Pena TF, Rivera FF (2013) Modeling the performance of parallel applications using model selection techniques. Concurr Comput Pract Exp doi:10.1002/cpe.3020
McCalpin JD (1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE computer society technical committee on computer architecture (TCCA) newsletter, pp 19–25
Mosberger D, Eranian S (2001) IA-64 linux kernel: design and implementation. Prentice Hall PTR, Upper Saddle River
Paradyn Project (2013) Paradyn, Cape Coral. http://www.cs.wisc.edu/paradyn/. Accessed 2014
perfmon2 (2013) Precise event-based sampling (PEBS). http://perfmon2.sourceforge.net/pfmon_intel_core.html#pebs. Accessed 2014
R Development Core Team (2008) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna (ISBN 3-900051-07-0)
Shende SS, Malony AD (2006) The tau parallel performance system. Int J High Perform Comput Appl 20(2):287–311
Taylor V, Wu X, Stevens R (2003) Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications. ACM SIGMETRICS Perform Eval Rev 30(4):13–18
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76 doi:10.1145/1498765.1498785
Wu X (1999) Performance, evaluation, prediction and visualization of parallel systems. Kluwer Academic Publishers, Boston
Acknowledgments
This work has been partially supported by the Ministry of Education and Science of Spain, FEDER funds under contract TIN 2010-17541, and Xunta de Galicia, EM2013/041. It has been developed in the framework of the European network HiPEAC-2 and the Spanish network CAPAP-H4 (TIN2011-15734-E).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lorenzo, O.G., Pena, T.F., Cabaleiro, J.C. et al. 3DyRM: a dynamic roofline model including memory latency information. J Supercomput 70, 696–708 (2014). https://doi.org/10.1007/s11227-014-1163-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1163-4