The Secrets of the Accelerators Unveiled: Tracing Heterogeneous Executions Through OMPT

Llort, Germán; Filgueras, Antonio; Jiménez-González, Daniel; Servat, Harald; Teruel, Xavier; Mercadal, Estanislao; Álvarez, Carlos; Giménez, Judit; Martorell, Xavier; Ayguadé, Eduard; Labarta, Jesús

doi:10.1007/978-3-319-45550-1_16

Germán Llort^16,17,
Antonio Filgueras^16,17,
Daniel Jiménez-González^16,17,
Harald Servat¹⁸,
Xavier Teruel^16,17,
Estanislao Mercadal^16,17,
Carlos Álvarez^16,17,
Judit Giménez^16,17,
Xavier Martorell^16,17,
Eduard Ayguadé^16,17 &
…
Jesús Labarta^16,17

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9903))

Included in the following conference series:

International Workshop on OpenMP

1157 Accesses
5 Citations

Abstract

Heterogeneous systems are an important trend in the future of supercomputers, yet they can be hard to program and developers still lack powerful tools to gain understanding about how well their accelerated codes perform and how to improve them.

Having different types of hardware accelerators available, each with their own specific low-level APIs to program them, there is not yet a clear consensus on a standard way to retrieve information about the accelerator’s performance. To improve this scenario, OMPT is a novel performance monitoring interface that is being considered for integration into the OpenMP standard. OMPT allows analysis tools to monitor the execution of parallel OpenMP applications by providing detailed information about the activity of the runtime through a standard API. For accelerated devices, OMPT also facilitates the exchange of performance information between the runtime and the analysis tool. We implement part of the OMPT specification that refers to the use of accelerators both in the Nanos++ parallel runtime system and the Extrae tracing framework, obtaining detailed performance information about the execution of the tasks issued to the accelerated devices to later conduct insightful analysis.

Our work extends previous efforts in the field to expose detailed information from the OpenMP and OmpSs runtimes, regarding the activity and performance of task-based parallel applications. In this paper, we focus on the evaluation of FPGA devices studying the performance of two common kernels in scientific algorithms: matrix multiplication and Cholesky decomposition. Furthermore, this development is seamlessly applicable for the analysis of GPGPU accelerators and Intel® Xeon Phi^TM co-processors operating under the OmpSs programming model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
At the moment of writing this document, the OMPT specification has gone through a major simplification. Due to the large number of changes in the latest version of the OMPT specification, our implementation is based on a hybrid version based on an earlier specification plus the latest target specification. As a result, the implementation we propose is a prototype and cannot be considered definitive but more an approach that shows how performance tools can take advantage of the OMPT specification for capturing accelerator activity.

References

BSC Tools. http://www.bsc.es/computer-sciences/performance-tools
CUDA Profiling Tools Interface. http://docs.nvidia.com/cuda/cupti
Extrae instrumentation package. http://www.bsc.es/paraver
Mercurium C/C++ source-to-source compiler. http://pm.bsc.es/projects/mcxx
Nanos++ RTL. http://pm.bsc.es/projects/nanox
NVIDIA CUDA Compute Unified Device Architecture Programming Guide. http://docs.nvidia.com
Top 500 supercomputing sites. http://www.top500.org
Zynq-7000 All Programmable SoC Overview. http://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf
Ayguade, E., Badia, R.M., Cabrera, D., Duran, A., Gonzalez, M., Igual, F., Jimenez, D., Labarta, J., Martorell, X., Mayo, R., Perez, J.M., Quintana-Ortí, E.S.: A proposal to extend the OpenMP tasking model for heterogeneous architectures. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 154–167. Springer, Heidelberg (2009)
Chapter Google Scholar
OpenMP Architecture Review Board. OpenMP Application Program Interface v 3.0, May 2008
Google Scholar
Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P.: A portable programming interface for performance evaluation on modern processors. Int. J. High Perform. Comput. Appl. 14, 189–204 (2000)
Article Google Scholar
Munshi, A., et al. (eds.) Khronos OpenCL Working Group. The OpenCL specification (2009). https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf
Cramer, T., Dietrich, R., Terboven, C., Müller, M.S., Nagel, W.E.: Performance analysis for target devices with the openmp tools interface. In: IEEE International Parallel and Distributed Processing Symposium Workshop, IPDpPS, Hyderabad, India, 25–29 May 2015, pp. 215–224 (2015)
Google Scholar
Eichenberger, A.E., Mellor-Crummey, J., Schulz, M., Wong, M., Copty, N., Dietrich, R., Liu, X., Loh, E., Lorenz, D.: OMPT: an OpenMP tools application programming interface for performance analysis. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 171–185. Springer, Heidelberg (2013)
Chapter Google Scholar
Filgueras, A., Gil, E., Jimenez-Gonzalez, D., Alvarez, C., Martorell, X., Langer, J., Noguera, J., Vissers, K.: Ompss@zynq all-programmable SoC ecosystem. In: Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, FPGA 2014, pp. 137–146, New York, NY, USA. ACM (2014)
Google Scholar
Fürlinger, K., Skinner, D.: Performance profiling for OpenMP tasks. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 132–139. Springer, Heidelberg (2009)
Chapter Google Scholar
Hindborg, A., Laguna, I., Karlsson, S., Ahn, D.H.: A Standard Debug Interface for OpenMP Target Regions
Google Scholar
Itzkowitz, M., Mazurov, O., Copty, N., Lin, Y.: An OpenMP Runtime API for Profiling. Sun Microsystems, Inc., OpenMP ARB White Paper. http://www.compunity.org/futures/omp-api.html
Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco (2013)
Google Scholar
Jiménez-González, D., Álvarez, C., Filgueras, A., Martorell, X., Langer, J., Noguera, J., Vissers, K.A.: Coarse-grain performance estimator for heterogeneous parallel computing architectures like zynq all-programmable SoC (2015). CoRR, abs/1508.06830
Google Scholar
Jost, G., Mazurov, O., an Mey, D.: Adding new dimensions to performance analysis through user-defined objects. In: Mueller, M.S., Chapman, B.M., de Supinski, B.R., Malony, A.D., Voss, M. (eds.) IWOMP 2005 and IWOMP 2006. LNCS, vol. 4315, pp. 255–266. Springer, Heidelberg (2008)
Chapter Google Scholar
Lorenz, D., Mohr, B., Rössel, C., Schmidl, D., Wolf, F.: How to reconcile event-based performance analysis with tasking in OpenMP. In: Sato, M., Hanawa, T., Müller, M.S., Chapman, B.M., de Supinski, B.R. (eds.) IWOMP 2010. LNCS, vol. 6132, pp. 109–121. Springer, Heidelberg (2010)
Chapter Google Scholar
Mohr, B., Malony, A., Hoppe, H.-C., Schlimbach, F., Haab, G., Shah, S.: A performance monitoring interface for OpenMP. In: Proceedings of the 4th European Workshop on OpenMP (EWOMP 2002), Rom, Italien, 2002. Record converted from VDB: 12 November 2012, September 2002
Google Scholar
Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. In: Computer Graphics Forum, vol. 26, pp. 80–113. Wiley Online Library (2007)
Google Scholar
Servat, H., Teruel, X., Llort, G., Duran, A., Giménez, J., Martorell, X., Ayguadé, E., Labarta, J.: On the Instrumentation of OpenMP and OmpSs tasking constructs. In: Caragiannis, I., et al. (eds.) Euro-Par Workshops 2012. LNCS, vol. 7640, pp. 414–428. Springer, Heidelberg (2013)
Chapter Google Scholar

Download references

Acknowledgments

This work was partially supported by the European Union H2020 program through the AXIOM project (grant ICT-01-2014 GA 645496) and the Mont-Blanc 2 project, by the Ministerio de Economía y Competitividad, under contracts Computación de Altas Prestaciones VII (TIN2015-65316-P); Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under projects MPEXPAR: Models de Programaciói Entorns d’Execució Paral \(\cdot \) lels (2014-SGR-1051) and 2009-SGR-980; the BSC-CNS Severo Ochoa program (SEV-2011-00067); the Intel-BSC Exascale Laboratory project; and the OMPT Working Group.

Author information

Authors and Affiliations

Department of Computer Sciences, Barcelona Supercomputing Center, Barcelona, Spain
Germán Llort, Antonio Filgueras, Daniel Jiménez-González, Xavier Teruel, Estanislao Mercadal, Carlos Álvarez, Judit Giménez, Xavier Martorell, Eduard Ayguadé & Jesús Labarta
Department of Computer Architecture, Polytechnic University of Catalonia-BarcelonaTech, Barcelona, Spain
Germán Llort, Antonio Filgueras, Daniel Jiménez-González, Xavier Teruel, Estanislao Mercadal, Carlos Álvarez, Judit Giménez, Xavier Martorell, Eduard Ayguadé & Jesús Labarta
Intel Corporation Iberia, Torre Picasso, 25th Floor, Plaza Pablo Ruiz Picasso, 1, 28020, Madrid, Spain
Harald Servat

Authors

Germán Llort
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Filgueras
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Jiménez-González
View author publications
You can also search for this author in PubMed Google Scholar
Harald Servat
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Teruel
View author publications
You can also search for this author in PubMed Google Scholar
Estanislao Mercadal
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Álvarez
View author publications
You can also search for this author in PubMed Google Scholar
Judit Giménez
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Martorell
View author publications
You can also search for this author in PubMed Google Scholar
Eduard Ayguadé
View author publications
You can also search for this author in PubMed Google Scholar
Jesús Labarta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Germán Llort .

Editor information

Editors and Affiliations

RIKEN AICS , Kobe, Japan
Naoya Maruyama
Lawrence Livermore National Laboratory , Livermore, California, USA
Bronis R. de Supinski
RIKEN AICS , Kobe, Japan
Mohamed Wahib

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Llort, G. et al. (2016). The Secrets of the Accelerators Unveiled: Tracing Heterogeneous Executions Through OMPT. In: Maruyama, N., de Supinski, B., Wahib, M. (eds) OpenMP: Memory, Devices, and Tasks. IWOMP 2016. Lecture Notes in Computer Science(), vol 9903. Springer, Cham. https://doi.org/10.1007/978-3-319-45550-1_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-45550-1_16
Published: 21 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45549-5
Online ISBN: 978-3-319-45550-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics