A methodology towards automatic performance analysis of parallel applications☆
Introduction
The performance achieved by a parallel application is the result of complex interactions between the hardware and software resources of the system where the application is being executed. The characteristics of the application, e.g., algorithmic structure, input parameters, problem size, influence these interactions by determining how the application exploits the available resources and the allocated processors. In this framework, tuning and debugging the performance of parallel applications become challenging issues [16].
A typical approach to address these issues is experimental, that is, based on instrumenting the application, monitoring its execution and analyzing its performance either on the fly or post-mortem. Many tools have been developed for this purpose (see e.g., [1], [2], [17], [18]). These tools analyze the measurements collected at run-time and provide statistics and diagrams describing the performance of the application and of its activities, e.g., computation, communication, I/O. The major drawback of these tools is that they fail to assist users in mastering the complexity inherent in this analysis.
To overcome this drawback, various methodological approaches have been proposed and tools have been developed out of these approaches with the aim of identifying performance bottlenecks, that is, the code regions, e.g., routines, loops, of the applications critical from the performance viewpoint. The Poirot project [8] proposed a tool architecture to automatic diagnose parallel applications using a heuristic classification scheme. The Paradyn Parallel Performance tool [12] dynamically instruments the applications to automate bottleneck detection at run-time. The Paradyn Performance Consultant starts a hierarchical search of the bottlenecks and refines this search by using stack sampling [14] and by pruning the search space considering the behavior of the application during previous runs [9]. The Kappa-Pi tool [3] deals with a post-mortem automatic performance analysis of message passing applications based on PVM. The analysis of processor utilizations leads to the identification of performance bottlenecks classified by means of a rule based knowledge system. Aksum [4] automatically performs multiple runs of a parallel application and detects performance bottlenecks by comparing the performance achieved varying the problem size and the number of allocated processors.
In this paper, we address the analysis of the performance of parallel applications from a methodological viewpoint with the aim of identifying and localizing performance inefficiencies. We define new performance metrics and criteria that highlight the properties of the applications and the load imbalance and dissimilarities in the behavior of the allocated processors. These metrics rely on the measurements collected by monitoring at run-time the applications. The integration of this methodology into a performance tool will help users in interpreting the performance achieved by their applications.
The paper is organized as follows. Section 2 presents the methodology and introduces metrics and criteria for the evaluation of the overall behavior of a parallel application. Section 3 focuses on the behavior of the processors allocated to the application. Section 4 presents an application of the methodology on a few case studies. Finally, Section 5 summarizes the methodology and discusses its integration into a performance analysis tool.
Section snippets
Characterization of performance properties
Tuning and debugging the performance of a parallel application can be seen as an iterative process consisting of several steps, dealing with the identification and localization of inefficiencies, their repair and the verification and validation of the achieved performance. Our objective is to define performance metrics and criteria for explaining the properties and the behavior of an application by identifying and localizing its performance inefficiencies.
As already stated, these metrics rely
Characterization of processor dissimilarities
The coarse grain characterization of the performance properties of parallel applications is followed by a fine grain characterization that focuses on the behavior of the processors with the objective of identifying the most imbalanced activity and code region.
Load balancing is an ideal condition for an application to achieve good performance by fully exploiting the benefits of parallel computing. Programming inefficiencies might lead to uneven work distribution among processors that, in turn,
Case studies
In this section, we discuss our methodology on three case studies dealing with the identification of the inefficiencies of two kernels from the NAS Parallel Benchmarks 2.3 suite [13] and of a computational fluid dynamic application [10].
These case studies illustrate the application of our methodology on programs with different characteristics and on measurements collected at different levels of granularity. Note that to derive preliminary insights into the behavior of the processors the case
Conclusions
Performance analysis of parallel applications is quite challenging. Many factors influence the performance and it is difficult to assess whether and where the applications have experienced poor performance.
The methodological approach presented in this paper is in the framework of automatic performance analysis of parallel applications and is aimed at the identification and localization of their performance inefficiencies. The methodology provides users with some guidelines for the
References (18)
- et al.
High-performance, portable implementation of the MPI Message Passing Interface standard
Parallel Computing
(1996) - et al.
Analyzing parallel program performance using normalized performance indices and trace transformation techniques
Parallel Computing
(1996) - et al.
Medea: a tool for workload characterization of parallel systems
IEEE Parallel and Distributed Technology
(1995) - L. DeRose, Y. Zhang, D.A. Reed, SvPablo: a multi-language performance analysis system, in: R. Puigjaner, N. Savino, B....
- A. Espinosa, T. Margalef, E. Luque, Automatic performance evaluation of parallel programs, in: Proceedings of 6th...
- T. Fahringer, M. Geissler, G. Madsen, H. Moritsch, C. Seragiotto, On using Aksum for semi-automatically searching of...
- K. Ferschweiler, M. Calzarossa, C. Pancake, D. Tessera, D. Keon, A community databank for performance tracefiles, in:...
Clustering Algorithms
(1975)- B. Helm, A. Malony, S. Fickas, Capturing and automating performance diagnosis: the Poirot approach, in: Proceedings of...
Cited by (9)
Automatic performance debugging of SPMD-style parallel programs
2011, Journal of Parallel and Distributed ComputingCitation Excerpt :By clustering thread performance for different metrics, PerfExplorer should discover these relationships and which metrics best distinguish their differences. Calzarossa et al. [6] proposes a top–down methodology towards automatic performance analysis of parallel applications: first, they focus on the overall behavior of the application in terms of its activities, and then they consider individual code regions and activities performed within each code region. Calzarossa et al. [6] utilizes clustering techniques to summarize and interpret the performance information by identifying patterns or groups of code regions characterized by a similar behavior.
Identifying the root causes of wait states in large-scale parallel applications
2016, ACM Transactions on Parallel ComputingPerformance analysis of MPI parallel programs on Xen virtual machines
2014, Proceedings - 2013 IEEE International Conference on High Performance Computing and Communications, HPCC 2013 and 2013 IEEE International Conference on Embedded and Ubiquitous Computing, EUC 2013Automatic performance diagnosis of parallel applications on heterogeneous systems
2012, International Journal of Digital Content Technology and its ApplicationsCharacterizing load and communication imbalance in large-scale parallel applications
2012, Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012Identifying the root causes of wait states in large-scale parallel applications
2010, Proceedings of the International Conference on Parallel Processing
- ☆
This work has been supported by the Italian Ministry of Education, Universities and Research (MIUR) under the FIRB and Cofin Programmes and by the University of Pavia under the FAR Programme.