A methodology towards automatic performance analysis of parallel applications

doi:10.1016/j.parco.2003.08.002

Parallel Computing

Volume 30, Issue 2, February 2004, Pages 211-223

https://doi.org/10.1016/j.parco.2003.08.002 Get rights and content

Abstract

Tuning and debugging the performance of parallel applications is an iterative process consisting of several steps dealing with identification and localization of inefficiencies, repair, and verification of the achieved performance. In this paper, we address the analysis of the performance of parallel applications from a methodological viewpoint with the aim of identifying and localizing inefficiencies. Our methodology is based on performance metrics and criteria that highlight the properties of the applications and the load imbalance and dissimilarities in the behavior of the processors. A few case studies illustrate the application of the methodology.

Introduction

The performance achieved by a parallel application is the result of complex interactions between the hardware and software resources of the system where the application is being executed. The characteristics of the application, e.g., algorithmic structure, input parameters, problem size, influence these interactions by determining how the application exploits the available resources and the allocated processors. In this framework, tuning and debugging the performance of parallel applications become challenging issues [16].

A typical approach to address these issues is experimental, that is, based on instrumenting the application, monitoring its execution and analyzing its performance either on the fly or post-mortem. Many tools have been developed for this purpose (see e.g., [1], [2], [17], [18]). These tools analyze the measurements collected at run-time and provide statistics and diagrams describing the performance of the application and of its activities, e.g., computation, communication, I/O. The major drawback of these tools is that they fail to assist users in mastering the complexity inherent in this analysis.

To overcome this drawback, various methodological approaches have been proposed and tools have been developed out of these approaches with the aim of identifying performance bottlenecks, that is, the code regions, e.g., routines, loops, of the applications critical from the performance viewpoint. The Poirot project [8] proposed a tool architecture to automatic diagnose parallel applications using a heuristic classification scheme. The Paradyn Parallel Performance tool [12] dynamically instruments the applications to automate bottleneck detection at run-time. The Paradyn Performance Consultant starts a hierarchical search of the bottlenecks and refines this search by using stack sampling [14] and by pruning the search space considering the behavior of the application during previous runs [9]. The Kappa-Pi tool [3] deals with a post-mortem automatic performance analysis of message passing applications based on PVM. The analysis of processor utilizations leads to the identification of performance bottlenecks classified by means of a rule based knowledge system. Aksum [4] automatically performs multiple runs of a parallel application and detects performance bottlenecks by comparing the performance achieved varying the problem size and the number of allocated processors.

In this paper, we address the analysis of the performance of parallel applications from a methodological viewpoint with the aim of identifying and localizing performance inefficiencies. We define new performance metrics and criteria that highlight the properties of the applications and the load imbalance and dissimilarities in the behavior of the allocated processors. These metrics rely on the measurements collected by monitoring at run-time the applications. The integration of this methodology into a performance tool will help users in interpreting the performance achieved by their applications.

The paper is organized as follows. Section 2 presents the methodology and introduces metrics and criteria for the evaluation of the overall behavior of a parallel application. Section 3 focuses on the behavior of the processors allocated to the application. Section 4 presents an application of the methodology on a few case studies. Finally, Section 5 summarizes the methodology and discusses its integration into a performance analysis tool.

Section snippets

Characterization of performance properties

Tuning and debugging the performance of a parallel application can be seen as an iterative process consisting of several steps, dealing with the identification and localization of inefficiencies, their repair and the verification and validation of the achieved performance. Our objective is to define performance metrics and criteria for explaining the properties and the behavior of an application by identifying and localizing its performance inefficiencies.

As already stated, these metrics rely

Characterization of processor dissimilarities

The coarse grain characterization of the performance properties of parallel applications is followed by a fine grain characterization that focuses on the behavior of the processors with the objective of identifying the most imbalanced activity and code region.

Load balancing is an ideal condition for an application to achieve good performance by fully exploiting the benefits of parallel computing. Programming inefficiencies might lead to uneven work distribution among processors that, in turn,

Case studies

In this section, we discuss our methodology on three case studies dealing with the identification of the inefficiencies of two kernels from the NAS Parallel Benchmarks 2.3 suite [13] and of a computational fluid dynamic application [10].

These case studies illustrate the application of our methodology on programs with different characteristics and on measurements collected at different levels of granularity. Note that to derive preliminary insights into the behavior of the processors the case

Conclusions

Performance analysis of parallel applications is quite challenging. Many factors influence the performance and it is difficult to assess whether and where the applications have experienced poor performance.

The methodological approach presented in this paper is in the framework of automatic performance analysis of parallel applications and is aimed at the identification and localization of their performance inefficiencies. The methodology provides users with some guidelines for the

References (18)

W Gropp et al.
High-performance, portable implementation of the MPI Message Passing Interface standard
Parallel Computing
(1996)
J.C Yan et al.
Analyzing parallel program performance using normalized performance indices and trace transformation techniques
Parallel Computing
(1996)
M Calzarossa et al.
Medea: a tool for workload characterization of parallel systems
IEEE Parallel and Distributed Technology
(1995)
L. DeRose, Y. Zhang, D.A. Reed, SvPablo: a multi-language performance analysis system, in: R. Puigjaner, N. Savino, B....
A. Espinosa, T. Margalef, E. Luque, Automatic performance evaluation of parallel programs, in: Proceedings of 6th...
T. Fahringer, M. Geissler, G. Madsen, H. Moritsch, C. Seragiotto, On using Aksum for semi-automatically searching of...
K. Ferschweiler, M. Calzarossa, C. Pancake, D. Tessera, D. Keon, A community databank for performance tracefiles, in:...
J.A Hartigan
Clustering Algorithms
(1975)
B. Helm, A. Malony, S. Fickas, Capturing and automating performance diagnosis: the Poirot approach, in: Proceedings of...

There are more references available in the full text version of this article.

Cited by (9)

Automatic performance debugging of SPMD-style parallel programs
2011, Journal of Parallel and Distributed Computing
Citation Excerpt :
By clustering thread performance for different metrics, PerfExplorer should discover these relationships and which metrics best distinguish their differences. Calzarossa et al. [6] proposes a top–down methodology towards automatic performance analysis of parallel applications: first, they focus on the overall behavior of the application in terms of its activities, and then they consider individual code regions and activities performed within each code region. Calzarossa et al. [6] utilizes clustering techniques to summarize and interpret the performance information by identifying patterns or groups of code regions characterized by a similar behavior.
Automatic performance debugging of parallel applications includes two main steps: locating performance bottlenecks and uncovering their root causes for performance optimization. Previous work fails to resolve this challenging issue in two ways: first, several previous efforts automate locating bottlenecks, but present results in a confined way that only identifies performance problems with a priori knowledge; second, several tools take exploratory or confirmatory data analysis to automatically discover relevant performance data relationships, but these efforts do not focus on locating performance bottlenecks or uncovering their root causes.
The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any prior knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; meanwhile, we present two searching algorithms to locate bottlenecks; second, on the basis of the rough set theory, we propose an innovative approach to automatically uncover root causes of bottlenecks; third, on the cluster systems with two different configurations, we use two production applications, written in Fortran 77, and one open source code—MPIBZIP2 (http://compression.ca/mpibzip2/), written in C++, to verify the effectiveness and correctness of our methods. For three applications, we also propose an experimental approach to investigating the effects of different metrics on locating bottlenecks.
Identifying the root causes of wait states in large-scale parallel applications
2016, ACM Transactions on Parallel Computing
Performance analysis of MPI parallel programs on Xen virtual machines
2014, Proceedings - 2013 IEEE International Conference on High Performance Computing and Communications, HPCC 2013 and 2013 IEEE International Conference on Embedded and Ubiquitous Computing, EUC 2013
Automatic performance diagnosis of parallel applications on heterogeneous systems
2012, International Journal of Digital Content Technology and its Applications
Characterizing load and communication imbalance in large-scale parallel applications
2012, Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012
Identifying the root causes of wait states in large-scale parallel applications
2010, Proceedings of the International Conference on Parallel Processing

View all citing articles on Scopus

^☆: This work has been supported by the Italian Ministry of Education, Universities and Research (MIUR) under the FIRB and Cofin Programmes and by the University of Pavia under the FAR Programme.

View full text

A methodology towards automatic performance analysis of parallel applications☆

Abstract

Introduction

Section snippets

Characterization of performance properties

Characterization of processor dissimilarities

Case studies

Conclusions

Parallel Computing

Parallel Computing

Medea: a tool for workload characterization of parallel systems

IEEE Parallel and Distributed Technology

Clustering Algorithms