Online root-cause performance analysis of parallel applications

doi:10.1016/j.parco.2015.05.003

Parallel Computing

Volume 48, October 2015, Pages 81-107

https://doi.org/10.1016/j.parco.2015.05.003 Get rights and content

Highlights

•
We present a technique for automated application performance modeling and analysis.
•
A parallel application is modeled by communication and computational activities.
•
The analysis discovers causal dependencies between problems and infers root causes.
•
It compares concurrent execution flows correlating problems in causal relationship.
•
Results of root-cause analysis for different MPI applications are encouraging.

Abstract

The evolution of hardware is improving at an incredible rate. However, the advances in parallel software have been hampered for many reasons. Developing an efficient parallel application is still not an easy task. Applications rarely achieve good performances immediately and, therefore, careful performance analysis and optimization are crucial. These tasks are difficult and require a thorough understanding of the programs behavior. In this paper, we propose a systematic approach to online root-cause performance analysis. The automated analysis uses an online model to quickly identify the most important performance problems, and correlates them with application source code. Our technique is able to discover causal dependencies among the problems, infer their root causes and explain them to developers. In all of the scenarios we performed, this online modelling and analysis approach allowed us to understand the behavior of the applications, evaluate the performance and locate problem causes without specific knowledge of application internals.

Introduction

Although the evolution of hardware is improving at an incredible rate, the advances in parallel software have been hampered for many reasons. The main reason is that developing efficient parallel applications with existing programming models is a complex task. Moreover, the parallel applications rarely achieve good performances immediately and, hence, careful performance analysis and optimization are crucial. These tasks are difficult and costly and, in practice, developers must understand both the application and the environmental behavior. They must often focus more on resource usage, communication or synchronization, than on the actual problem being solved by their programs.

There are tools that automate the identification of performance bottlenecks and their locations in the source code [1], [2], [3], [4], [5], [6]. They help a developer in understanding what happens, where, and when [7], [8], [9], but they do not automate the inference process to find the causes of the performance problems. Detecting a bottleneck somewhere does not indicate why it happens and is often misleading. Only when the root-causes of a performance problem are correctly identified, is it possible to provide effective solutions. An overhead originating at a certain point in a task can causally propagate through the task flow and then through the message flow to another task and cause further inefficiencies at other points. It is necessary to provide tools that better assist developers in understanding the behavior of their programs by not only automating the search for performance problems, but also for their root causes. Such tools could be valuable for both non-experienced and expert users and, could ease and shorten the performance optimization process. In spite of this, the automation of root-cause analysis is still an open field of research. Many performance problems can be quickly located and explained with automated techniques that work on unmodified parallel applications during their execution [10], [11].

To address all these challenges, we have developed a new two-step approach for dynamic and automated application performance modelling and analysis. In this approach, an MPI application is automatically modelled and diagnosed during its execution. First, an online performance modelling technique enables the automated discovery of causal execution paths through communication and computational activities in message-passing parallel programs. Second, the automated analysis uses the online model to quickly identify the most important performance problems, and correlates them with an application source code (tasks, modules or functions). The performance analysis is performed at run-time and is based on the continuously updated model. Therefore, performance problems can be identified significantly faster than in a post-mortem approach. The analysis techniques investigate not only the performance problems, but also their causal relationships, and infer root causes in certain scenarios. In this context, developers and non-expert users are exempted from some of the performance-related duties.

Using this approach, it is possible to discover causal dependencies among the problems, infer their root causes during the application execution and explain them to developers. The online application model is based on the previous work presented in [12], while the methodology for the root-cause analysis is the main contribution of this paper.

The remainder of this paper is organized as follows. Section 2 introduces our approach for online performance analysis that can be deployed on arbitrary MPI applications running in large-scale parallel systems. Section 3 briefly describes an online performance modelling technique that we have proposed for understanding the behavior of parallel applications. Section 4 presents the automated root-cause analysis of performance problems using the online modelling technique. Section 5 presents examples of parallel applications that we were able to analyze online with our automated techniques. Section 6 surveys related work, and, finally, Section 7 concludes our work and suggests direction for future research.

Section snippets

Online performance analysis of parallel applications

To provide online performance analysis of parallel applications, we developed a set of techniques for monitoring, modelling and diagnosing these applications during execution. Two main phases of this approach are online performance modelling and performance analysis.

To understand application behavior, we developed an online performance modelling technique that is based on the previously presented work [12]. By following the execution flow and intercepting communication at run-time, this

Performance model of parallel applications

Our approach to the online construction of the application model combines features of both static and dynamic analysis methods. We perform an offline analysis of binary executable, discover static code structure and dynamically instrument selected loops to detect cycle boundaries. At run-time, we perform selective event tracing and the aggregation of executed activities. This technique maintains a tradeoff between a large volume of collected data and a preserved level of details [12].

Root-cause performance analysis

We have defined and developed a root-cause performance analysis (RCA) approach. RCA is an iterative process that is divided into the following phases performed during the application execution:

•
Phase 1: Identification of problems. The goal of this phase is to detect the most severe performance bottlenecks and their locations in the application. We identify the problems for each individual task and for the entire application. A performance bottleneck is defined as an activity whose accumulated

Experimental work

In order to validate our analysis approach as able to detect and correctly diagnose performance problems, we applied it to find the causes of several problems in parallel applications, in particular:

•
SPMD—WaveSend. This program implements the concurrent wave equation as described in [24]. A vibrating string is decomposed into a vector of points. Since the amplitude of each point depends on its neighbors, a contiguous block of points is assigned to each task. Each task is responsible for

Related work

The inter-task synchronization and its performance impact is a well-known problem. Some sources call it a wait time analysis (Carnival [16]), inefficiency analysis (KappaPI [29]) or formalize the descriptions of the problems by means of performance patterns or properties (APART ASL [31], KappaPI-2 [3], EXPERT [5], Scalasca [6], Periscope [32]). All of these approaches use different forms to express knowledge about common performance overheads of message-passing parallel programs. In our work,

Conclusions

We have developed and evaluated a systematic approach for online application performance modelling and analysis. In this approach, the application is monitored, modelled and diagnosed during its execution. The automated analysis determines the most important performance problems, correlates them with application source code, attempts to infer their root-causes and explain them to developers.

The online performance modelling enables autonomous and low-overhead execution monitoring that generates

Acknowledgment

This research has been supported by the MICINN-Spain under contract TIN2011-28689.

References (35)

F. Wolf et al.
Automatic performance analysis of hybrid MPI/OpenMP applications
J. Syst. Archit.
(2003)
V. Pillet et al.
PARAVER: A Tool to Visualize and Analyze Parallel Code, Technical Report
(1995)
R.M. Badia et al.
Dimemas: Predicting MPI applications behaviour in grid environments
Workshop on Grid Applications and Programming Tools (GGF8)
(2003)
J. Jorba et al.
Performance analysis of parallel applications with KappaPI 2
(2005)
M. Geimer et al.
Scalable parallel trace-based performance analysis
(2006)
D. Lorenz et al.
Extending scalascaś analysis features
Tools for High Performance Computing 2012
(2013)
B.P. Miller et al.
The Paradyn parallel performance measurement tool
Computer
(1995)
W.E. Nagel et al.
VAMPIR: Visualization and analysis of MPI resources
Supercomputer
(1996)
H.L. Truong et al.
SCALEA: A performance analysis tool for parallel programs
Concurr. Comput.: Pract. Exper.
(2003)
A. Morajko et al.
MATE: Dynamic performance tuning environment
(2004)

C. Rosas et al.

Improving performance on data-intensive applications using a load balancing methodology based on divisible load theory

Int. J. Parallel Program.

(2014)

O. Morajko et al.

On-line performance modeling for MPI applications

(2008)

O. Morajko, Online performance modeling and analysis of message-passing parallel applications (Ph.D. thesis),...

A.R. Bernat et al.

Incremental call-path profiling: Research articles

Concurr. Comput.: Pract. Exper.

(2007)

M. Gerndt et al.

Specification and detection of performance problems with ASL: Research articles

Concurr. Comput.: Pract. Exper.

(2007)

W. Meira et al.

Waiting time analysis and performance visualization in carnival

Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT’96)

(1996)

W. Meira, Jr., Understanding parallel program performance using cause-effect analysis (Ph.D. thesis), Rochester, NY,...

Cited by (1)

Automated and dynamic abstraction of MPI application performance
2016, Cluster Computing

View full text

Online root-cause performance analysis of parallel applications

Highlights

Abstract

Introduction

Section snippets

Online performance analysis of parallel applications

Performance model of parallel applications

Root-cause performance analysis

Experimental work

Related work

Conclusions

Acknowledgment

J. Syst. Archit.

PARAVER: A Tool to Visualize and Analyze Parallel Code, Technical Report

Dimemas: Predicting MPI applications behaviour in grid environments

Workshop on Grid Applications and Programming Tools (GGF8)

Performance analysis of parallel applications with KappaPI 2

Scalable parallel trace-based performance analysis

Extending scalascaś analysis features

Tools for High Performance Computing 2012

The Paradyn parallel performance measurement tool

Computer

VAMPIR: Visualization and analysis of MPI resources

Supercomputer

SCALEA: A performance analysis tool for parallel programs

Concurr. Comput.: Pract. Exper.

MATE: Dynamic performance tuning environment

Improving performance on data-intensive applications using a load balancing methodology based on divisible load theory

Int. J. Parallel Program.

On-line performance modeling for MPI applications

Incremental call-path profiling: Research articles

Concurr. Comput.: Pract. Exper.

Specification and detection of performance problems with ASL: Research articles

Concurr. Comput.: Pract. Exper.

Waiting time analysis and performance visualization in carnival

Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT’96)