Abstract
Failing to find the best optimization sequence for a given application code can lead to compiler generated codes with poor performances or inappropriate code. It is necessary to analyze performances from the assembly generated code to improve over the compilation process. This paper presents a tool for the performance analysis of multithreaded codes (OpenMP programs support at the moment). MAQAO relies on static performance evaluation to identify compiler optimizations and assess performance of loops. It exploits static binary rewriting for reading and instrumenting object files or executables. Static binary instrumentation allows the insertion of probes at instruction level. Memory accesses can be captured to help tune the code, but such traces require to be compressed. MAQAO can analyze the results and provide hints for tuning the code. We show on some examples how this can help users improve their OpenMP applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Acumum AB. Acumem SlowSpotter and Acumem ThreadSpotter, 2009. http://www.acumem.com/content/view/133/182/.
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Technical Report TR08-06, Rice University, 2008.
A. Alexandrov, S. Bratanov, J. Fedorova, D. Levinthal, I. Lopatin, and D. Ryabtsev. Parallelization Made Easier with Intel Performance-Tuning Utility, 2007. http://www.intel.com/technology/itj/2007/v11i4/.
B. Buck and J. K. Hollingsworth. An API for Runtime Code Patching. Intl. Journal of High Performance Computing Applications, 14:317–329, 2000.
Intel Corporation. Intel VTune Performance Analyzer 9.1, 2009. http://software.intel.com/en-us/intel-vtune/.
L. Djoudi, D. Barthou, P. Carribault, C. Lemuet, J-T. Acquaviva, and W. Jalby. Exploring Application Performance: a New Tool For a Static/Dynamic Approach. In Los Alamos Computer Science Institute Symp., Santa Fe, NM, October 2005.
E. N. Elnozahy. Address trace compression through loop detection and reduction. SIGMETRICS Perform. Eval. Rev., 27(1):214–215, 1999.
Agner F. Software optimization resources, 2009. http://www.agner.org/optimize/.
L. Georgiadis, R. F. Werneck, R. E. Tarjan, S. Triantafyllis, and D. I. August. Algorithms - ESA, 3221:677–688, 2004.
W. Jalby, C. Lemuet, and X. Le Pasteur. A New Set of Microbenchmarks to Explore Memory System Performance for Scientific Computing, 2004. International Journal of High Performance Computing Applications.
A. Ketterlin and Ph. Clauss. Prediction and Trace Compression of Data Access trough Nested Loop Recognition. In ACM/IEEE Int. Symp. on Code Optimization and Generation, 2008.
S. Koliai, S. Zuckerman, E. Oseret, M. Ivascot, T. Moseley, D. Quang, and W. Jalby. A Balanced Approach to Application Performance Tuning. In Proc. of LCPC, LNCS, Delaware, USA, October 2009. Springer.
J. Marathe, F. Mueller, T. Mohan, B. R. de Supinski, S. A. McKee, and A. Yoo. METRIC: Tracking Down Inefficiencies in the Memory Hierarchy via Binary Rewriting. ACM/IEEE Int. Symp. on Code Optimization and Generation, 0:289, 2003.
N. Nethercote and J. Seward. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. 2007. Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), San Diego, California, USA, June 2007.
C. Mills Olschanowsky, M. Tikir, L. Carrington, and A. Snavely. PSnAP: Accurate Synthetic Address Streams Through Memory Profiles. In Int. Workshop on Languages and Compilers for Parallel Computing, 2009.
ParMA ITEA2 Project: Parallel Programming for Multicore Architectures. http://www.parma-itea2.org/.
B. Risio, A. Berreth, S. Zuckerman, S. Koliai, M. Ivascot, W. Jalby, B. Krammer, B. Mohr, and T. William. How to Accelerate an Application: a Practical Case Study in Combustion Modelling. In Proc. of ParCo, Lyon, France, 2009.
C. Valensi and D. Barthou. MADRAS: Multi-Architecture Disassembler and Reassembler, 2009. http://maqao.prism.uvsq.fr/wiki/wiki/MadrasDownload.
S. Wallace and K. Hazelwood. SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance. In ACM/IEEE Int. Symp. on Code Optimization and Generation, pages 209–217, San Jose, CA, March 2007.
F. Wolf, B.J.N. Wylie, E. Ábrahám, D. Becker, W. Frings, K. Fürlinger, M. Geimer, M.-A. Hermanns, B. Mohr, S. Moore, M. Pfeifer, and Z. Szebenyi. Usage of the SCALASCA Toolset for Scalable Performance Analysis of Large-Scale Parallel Applications. In Proc. of the 2nd HLRS Parallel Tools Workshop, pages 157–167, Stuttgart, Germany, July 2008. Springer. ISBN 978-3-540-68561-6.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Barthou, D., Charif Rubial, A., Jalby, W., Koliai, S., Valensi, C. (2010). Performance Tuning of x86 OpenMP Codes with MAQAO. In: Müller, M., Resch, M., Schulz, A., Nagel, W. (eds) Tools for High Performance Computing 2009. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11261-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-11261-4_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11260-7
Online ISBN: 978-3-642-11261-4
eBook Packages: Computer ScienceComputer Science (R0)