Online root-cause performance analysis of parallel applications
Introduction
Although the evolution of hardware is improving at an incredible rate, the advances in parallel software have been hampered for many reasons. The main reason is that developing efficient parallel applications with existing programming models is a complex task. Moreover, the parallel applications rarely achieve good performances immediately and, hence, careful performance analysis and optimization are crucial. These tasks are difficult and costly and, in practice, developers must understand both the application and the environmental behavior. They must often focus more on resource usage, communication or synchronization, than on the actual problem being solved by their programs.
There are tools that automate the identification of performance bottlenecks and their locations in the source code [1], [2], [3], [4], [5], [6]. They help a developer in understanding what happens, where, and when [7], [8], [9], but they do not automate the inference process to find the causes of the performance problems. Detecting a bottleneck somewhere does not indicate why it happens and is often misleading. Only when the root-causes of a performance problem are correctly identified, is it possible to provide effective solutions. An overhead originating at a certain point in a task can causally propagate through the task flow and then through the message flow to another task and cause further inefficiencies at other points. It is necessary to provide tools that better assist developers in understanding the behavior of their programs by not only automating the search for performance problems, but also for their root causes. Such tools could be valuable for both non-experienced and expert users and, could ease and shorten the performance optimization process. In spite of this, the automation of root-cause analysis is still an open field of research. Many performance problems can be quickly located and explained with automated techniques that work on unmodified parallel applications during their execution [10], [11].
To address all these challenges, we have developed a new two-step approach for dynamic and automated application performance modelling and analysis. In this approach, an MPI application is automatically modelled and diagnosed during its execution. First, an online performance modelling technique enables the automated discovery of causal execution paths through communication and computational activities in message-passing parallel programs. Second, the automated analysis uses the online model to quickly identify the most important performance problems, and correlates them with an application source code (tasks, modules or functions). The performance analysis is performed at run-time and is based on the continuously updated model. Therefore, performance problems can be identified significantly faster than in a post-mortem approach. The analysis techniques investigate not only the performance problems, but also their causal relationships, and infer root causes in certain scenarios. In this context, developers and non-expert users are exempted from some of the performance-related duties.
Using this approach, it is possible to discover causal dependencies among the problems, infer their root causes during the application execution and explain them to developers. The online application model is based on the previous work presented in [12], while the methodology for the root-cause analysis is the main contribution of this paper.
The remainder of this paper is organized as follows. Section 2 introduces our approach for online performance analysis that can be deployed on arbitrary MPI applications running in large-scale parallel systems. Section 3 briefly describes an online performance modelling technique that we have proposed for understanding the behavior of parallel applications. Section 4 presents the automated root-cause analysis of performance problems using the online modelling technique. Section 5 presents examples of parallel applications that we were able to analyze online with our automated techniques. Section 6 surveys related work, and, finally, Section 7 concludes our work and suggests direction for future research.
Section snippets
Online performance analysis of parallel applications
To provide online performance analysis of parallel applications, we developed a set of techniques for monitoring, modelling and diagnosing these applications during execution. Two main phases of this approach are online performance modelling and performance analysis.
To understand application behavior, we developed an online performance modelling technique that is based on the previously presented work [12]. By following the execution flow and intercepting communication at run-time, this
Performance model of parallel applications
Our approach to the online construction of the application model combines features of both static and dynamic analysis methods. We perform an offline analysis of binary executable, discover static code structure and dynamically instrument selected loops to detect cycle boundaries. At run-time, we perform selective event tracing and the aggregation of executed activities. This technique maintains a tradeoff between a large volume of collected data and a preserved level of details [12].
Root-cause performance analysis
We have defined and developed a root-cause performance analysis (RCA) approach. RCA is an iterative process that is divided into the following phases performed during the application execution:
- •
Phase 1: Identification of problems. The goal of this phase is to detect the most severe performance bottlenecks and their locations in the application. We identify the problems for each individual task and for the entire application. A performance bottleneck is defined as an activity whose accumulated
Experimental work
In order to validate our analysis approach as able to detect and correctly diagnose performance problems, we applied it to find the causes of several problems in parallel applications, in particular:
- •
SPMD—WaveSend. This program implements the concurrent wave equation as described in [24]. A vibrating string is decomposed into a vector of points. Since the amplitude of each point depends on its neighbors, a contiguous block of points is assigned to each task. Each task is responsible for
Related work
The inter-task synchronization and its performance impact is a well-known problem. Some sources call it a wait time analysis (Carnival [16]), inefficiency analysis (KappaPI [29]) or formalize the descriptions of the problems by means of performance patterns or properties (APART ASL [31], KappaPI-2 [3], EXPERT [5], Scalasca [6], Periscope [32]). All of these approaches use different forms to express knowledge about common performance overheads of message-passing parallel programs. In our work,
Conclusions
We have developed and evaluated a systematic approach for online application performance modelling and analysis. In this approach, the application is monitored, modelled and diagnosed during its execution. The automated analysis determines the most important performance problems, correlates them with application source code, attempts to infer their root-causes and explain them to developers.
The online performance modelling enables autonomous and low-overhead execution monitoring that generates
Acknowledgment
This research has been supported by the MICINN-Spain under contract TIN2011-28689.
References (35)
- et al.
Automatic performance analysis of hybrid MPI/OpenMP applications
J. Syst. Archit.
(2003) - et al.
PARAVER: A Tool to Visualize and Analyze Parallel Code, Technical Report
(1995) - et al.
Dimemas: Predicting MPI applications behaviour in grid environments
Workshop on Grid Applications and Programming Tools (GGF8)
(2003) - et al.
Performance analysis of parallel applications with KappaPI 2
(2005) - et al.
Scalable parallel trace-based performance analysis
(2006) - et al.
Extending scalascaś analysis features
Tools for High Performance Computing 2012
(2013) - et al.
The Paradyn parallel performance measurement tool
Computer
(1995) - et al.
VAMPIR: Visualization and analysis of MPI resources
Supercomputer
(1996) - et al.
SCALEA: A performance analysis tool for parallel programs
Concurr. Comput.: Pract. Exper.
(2003) - et al.
MATE: Dynamic performance tuning environment
(2004)
Improving performance on data-intensive applications using a load balancing methodology based on divisible load theory
Int. J. Parallel Program.
On-line performance modeling for MPI applications
Incremental call-path profiling: Research articles
Concurr. Comput.: Pract. Exper.
Specification and detection of performance problems with ASL: Research articles
Concurr. Comput.: Pract. Exper.
Waiting time analysis and performance visualization in carnival
Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT’96)
Cited by (1)
Automated and dynamic abstraction of MPI application performance
2016, Cluster Computing