Performance Tuning of x86 OpenMP Codes with MAQAO

Barthou, Denis; Charif Rubial, Andres; Jalby, William; Koliai, Souad; Valensi, Cédric

doi:10.1007/978-3-642-11261-4_7

Denis Barthou⁵,
Andres Charif Rubial,
William Jalby,
Souad Koliai &
…
Cédric Valensi

756 Accesses

Abstract

Failing to find the best optimization sequence for a given application code can lead to compiler generated codes with poor performances or inappropriate code. It is necessary to analyze performances from the assembly generated code to improve over the compilation process. This paper presents a tool for the performance analysis of multithreaded codes (OpenMP programs support at the moment). MAQAO relies on static performance evaluation to identify compiler optimizations and assess performance of loops. It exploits static binary rewriting for reading and instrumenting object files or executables. Static binary instrumentation allows the insertion of probes at instruction level. Memory accesses can be captured to help tune the code, but such traces require to be compressed. MAQAO can analyze the results and provide hints for tuning the code. We show on some examples how this can help users improve their OpenMP applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

FAROS: A Framework to Analyze OpenMP Compilation Through Benchmarking and Compiler Optimization Analysis

OpenMP $$^{\textregistered }$$ Runtime Instrumentation for Optimization

KART – A Runtime Compilation Library for Improving HPC Application Performance

References

Acumum AB. Acumem SlowSpotter and Acumem ThreadSpotter, 2009. http://www.acumem.com/content/view/133/182/.
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Technical Report TR08-06, Rice University, 2008.
Google Scholar
A. Alexandrov, S. Bratanov, J. Fedorova, D. Levinthal, I. Lopatin, and D. Ryabtsev. Parallelization Made Easier with Intel Performance-Tuning Utility, 2007. http://www.intel.com/technology/itj/2007/v11i4/.
B. Buck and J. K. Hollingsworth. An API for Runtime Code Patching. Intl. Journal of High Performance Computing Applications, 14:317–329, 2000.
Article Google Scholar
Intel Corporation. Intel VTune Performance Analyzer 9.1, 2009. http://software.intel.com/en-us/intel-vtune/.
L. Djoudi, D. Barthou, P. Carribault, C. Lemuet, J-T. Acquaviva, and W. Jalby. Exploring Application Performance: a New Tool For a Static/Dynamic Approach. In Los Alamos Computer Science Institute Symp., Santa Fe, NM, October 2005.
Google Scholar
E. N. Elnozahy. Address trace compression through loop detection and reduction. SIGMETRICS Perform. Eval. Rev., 27(1):214–215, 1999.
Article Google Scholar
Agner F. Software optimization resources, 2009. http://www.agner.org/optimize/.
L. Georgiadis, R. F. Werneck, R. E. Tarjan, S. Triantafyllis, and D. I. August. Algorithms - ESA, 3221:677–688, 2004.
Google Scholar
W. Jalby, C. Lemuet, and X. Le Pasteur. A New Set of Microbenchmarks to Explore Memory System Performance for Scientific Computing, 2004. International Journal of High Performance Computing Applications.
Google Scholar
A. Ketterlin and Ph. Clauss. Prediction and Trace Compression of Data Access trough Nested Loop Recognition. In ACM/IEEE Int. Symp. on Code Optimization and Generation, 2008.
Google Scholar
S. Koliai, S. Zuckerman, E. Oseret, M. Ivascot, T. Moseley, D. Quang, and W. Jalby. A Balanced Approach to Application Performance Tuning. In Proc. of LCPC, LNCS, Delaware, USA, October 2009. Springer.
Google Scholar
J. Marathe, F. Mueller, T. Mohan, B. R. de Supinski, S. A. McKee, and A. Yoo. METRIC: Tracking Down Inefficiencies in the Memory Hierarchy via Binary Rewriting. ACM/IEEE Int. Symp. on Code Optimization and Generation, 0:289, 2003.
Google Scholar
N. Nethercote and J. Seward. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. 2007. Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), San Diego, California, USA, June 2007.
Google Scholar
C. Mills Olschanowsky, M. Tikir, L. Carrington, and A. Snavely. PSnAP: Accurate Synthetic Address Streams Through Memory Profiles. In Int. Workshop on Languages and Compilers for Parallel Computing, 2009.
Google Scholar
ParMA ITEA2 Project: Parallel Programming for Multicore Architectures. http://www.parma-itea2.org/.
B. Risio, A. Berreth, S. Zuckerman, S. Koliai, M. Ivascot, W. Jalby, B. Krammer, B. Mohr, and T. William. How to Accelerate an Application: a Practical Case Study in Combustion Modelling. In Proc. of ParCo, Lyon, France, 2009.
Google Scholar
C. Valensi and D. Barthou. MADRAS: Multi-Architecture Disassembler and Reassembler, 2009. http://maqao.prism.uvsq.fr/wiki/wiki/MadrasDownload.
S. Wallace and K. Hazelwood. SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance. In ACM/IEEE Int. Symp. on Code Optimization and Generation, pages 209–217, San Jose, CA, March 2007.
Google Scholar
F. Wolf, B.J.N. Wylie, E. Ábrahám, D. Becker, W. Frings, K. Fürlinger, M. Geimer, M.-A. Hermanns, B. Mohr, S. Moore, M. Pfeifer, and Z. Szebenyi. Usage of the SCALASCA Toolset for Scalable Performance Analysis of Large-Scale Parallel Applications. In Proc. of the 2nd HLRS Parallel Tools Workshop, pages 157–167, Stuttgart, Germany, July 2008. Springer. ISBN 978-3-540-68561-6.
Google Scholar

Download references

Author information

Authors and Affiliations

LaBRI/INRIA, University of Bordeaux, Bordeaux, France
Denis Barthou

Authors

Denis Barthou
View author publications
You can also search for this author in PubMed Google Scholar
Andres Charif Rubial
View author publications
You can also search for this author in PubMed Google Scholar
William Jalby
View author publications
You can also search for this author in PubMed Google Scholar
Souad Koliai
View author publications
You can also search for this author in PubMed Google Scholar
Cédric Valensi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Zentrum für Informationsdienste und, Hochleistungsrechnen (ZIH), TU Dresden, Dresden, 01062, Germany
Matthias S. Müller
Höchstleistungsrechenzentrum (HLRS), Universität Stuttgart, Nobelstr. 19, Stuttgart, 70569, Germany
Michael M. Resch
Höchstleistungsrechenzentrum (HLRS), Universität Stuttgart, Nobelstr. 19, Stuttgart, 70569, Germany
Alexander Schulz
Zentrum für Informationsdienste und, Hochleistungsrechnen (ZIH), TU Dresden, Dresden, 01062, Germany
Wolfgang E. Nagel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barthou, D., Charif Rubial, A., Jalby, W., Koliai, S., Valensi, C. (2010). Performance Tuning of x86 OpenMP Codes with MAQAO. In: Müller, M., Resch, M., Schulz, A., Nagel, W. (eds) Tools for High Performance Computing 2009. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11261-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-11261-4_7
Published: 27 May 2010
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11260-7
Online ISBN: 978-3-642-11261-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics