Locating and categorizing inefficient communication patterns in HPC systems using inter-process communication traces

https://doi.org/10.1016/j.jss.2022.111494Get rights and content

Highlights

  • This paper describes a novel approach for detecting inefficient communication patterns in HPC system execution traces using information theory and statistical analysis methods.

  • The approach can categorize inefficient patterns based on their various severity and complexity levels to better guide analysts.

  • The approach is useful for software developers and engineers when debugging and analyzing performance issues in HPC systems.

  • The effectiveness of the approach is shown by applying it to five large traces from three open HPC programs that use MPI for inter-process communication.

  • The threats to validity and the limitations of the approach are carefully discussed along with important future directions in the field.

Abstract

High Performance Computing (HPC) systems are used in a variety of industrial and research sectors to solve complex problems that require powerful computing platforms. For these systems to remain reliable, we should be able to debug and analyze their behavior in order to detect root causes of potential poor performance. Execution traces hold important information regarding the events and interactions among communicating processes, which are essential for the debugging of inter-process communication. Traces, however, tend to be considerably large, hindering their applicability. In previous work, we presented an approach for automatically detecting communication patterns and segmenting large HPC traces into execution phases. The goal is to reduce the effort of analyzing traces by allowing software analysts to focus on smaller parts of interest. In this paper, we propose an approach for detecting and localizing inefficient communication patterns using statistical and trace segmentation methods. In addition, we use the Analytic Hierarchy Process to categorize slow communication patterns based on their severity and complexity levels. Using our approach, an analyst can quickly locate slow communication patterns that may be the cause of important performance problems. We show the effectiveness of our approach by applying it to large traces from three HPC systems.

Introduction

The demand for High Performance Computing (HPC) systems continues to grow to meet the needs of many industrial and research sectors such as bioinformatics, medical information processing, and financial analytics, for powerful systems to process and solve large and complex problems (Heldens et al., 2020). The popularity of HPC programs has further flourished with the advent of multicore and cloud computing environments.

HPC programs that are developed using the Message Passing Interface (MPI) standard (MPI Forum, 2012) rely on a large number of processes working together by exchanging messages to solve computationally intensive problems. MPI combines processes in different groups called Communicators. Processes in one communicator interact with each other according to a virtual topology, which usually follows a linear, 2 or 3-dimensional mesh structure. Processes communicate with their nearest or non-nearest neighbors in the mesh. In a typical MPI program, these communications are repetitive and form communication patterns. A communication pattern groups sequences of MPI communication events from different processes that are working towards a specific task. The binary tree and butterfly patterns are examples of communication patterns (Navaridas et al., 2008). Fig. 1 shows the butterfly and the binary tree patterns for eight processes.

Performance analysis and debugging of HPC systems require dynamic analysis techniques due to the distributed nature of these systems. An early work presented by Preissl et al. (2008) showed that automatic identification of communication patterns from execution traces can be useful for understanding an application’s communication behavior, that would eventually facilitate debugging and performance analysis tasks. The problem is that typical traces can be overwhelmingly large with many instances of various communication patterns. The mere detection of communication patterns may still generate a lot of data that is hard for a software analyst to grasp. To alleviate this problem, researchers have proposed the concept of trace segmentation in which a large trace is partitioned into distinct segments, which depict execution phases of the traced scenario, and detect communication patterns within each segment (Casas et al., 2010, Isaacs et al., 2015, Alawneh et al., 2016).

An execution phase is broadly defined as a region within the trace that contains communication patterns, which implement a specific program functionality (Alawneh et al., 2016). Trace segmentation is also used to support program comprehension tasks in monolithic systems (Pirzadeh et al., 2013). The main objective is to provide a way for the software analyst to only focus on parts of the trace of interest instead of browsing the whole trace. Casas et al. (2010) and Chetsa et al. (2013) showed how MPI trace execution phases can help with performance optimization tasks by uncovering regions in a trace with the highest latency. Isaacs et al. (2015) presented a trace visualization and analysis tool that logically orders and visualizes the MPI communication behavior into fine-grained phases to determine the lateness in program operations using temporal metrics and visual inspection.

In our previous work (Alawneh et al., 2016), we proposed an effective trace segmentation approach, which involves two main steps. In the first step, we detect communication patterns in the entire trace using natural language processing techniques. In the second step, we use the extracted communication patterns to identify dense homogeneous clusters, which represent distinct execution phases of the trace. This is achieved using information theory concepts such as Shannon entropy (Shannon, 1948) and the Jensen–Shannon Divergence measure (Grosse et al., 2002). The new contributions of this paper are summarized as follows:

  • We improve our previous technique for segmenting traces into execution phases using the Akaike Information Criterion (AIC) (Akaike, 1981) to identify finer execution phases.

  • We extend the communication pattern detection approach by using distinctive events to identify the boundaries of coherent communication events in each process trace, which facilitates the detection of process repeating patterns.

  • We propose an approach for detecting inefficient communication pattern instances in trace segments using statistical analysis. More specifically, we use the Median Absolute Deviation (MAD) and the Modified Z-score measures (Iglewicz and Hoaglin, 1993) to determine slow communication patterns.

  • We propose an approach for the categorization of communication patterns using the Analytic Hierarchy Process (AHP) (Saaty, 1990) by examining the complexity and severity levels for slow patterns in execution phases.

  • We demonstrate the effectiveness of our approach by applying it to five large traces generated from three different HPC systems.

  • Through the analysis of a sample of inefficient patterns detected by our approach, we provide a detailed discussion on the potential root causes, which demonstrate the usefulness of our approach in practice.

The rest of the paper is organized as follows. Section 2 presents a background of HPC and sequence segmentation followed by related studies on techniques for the analysis of MPI programs in Section 3. Section 4 details the proposed approach. In Section 5, we apply our approach on several traces generated from HPC systems and show how it could detect patterns of inefficient behavior. We conclude our paper in Section 6 and discuss future directions.

Section snippets

Background

This section starts by providing a more detailed view of HPC and communication patterns. Then, it presents the sequence segmentation technique that we use in our study for identifying computational phases.

Related work

Several tools have been developed for the visualization of MPI traces to facilitate program comprehension and system analysis tasks (ZIH, 2022, Shende and Malony, 2006). Although trace visualization tools capture details regarding the whole execution trace, it is difficult to analyze and comprehend the program execution by mere reliance on visualization techniques. For example, Fig. 4 shows a zoomed-in view of a trace of 16 processes using the Vampir (ZIH, 2022) visualization tool. This typical

The approach

Fig. 5 shows our overall approach for detecting and locating inefficient communication patterns in MPI traces. We start by extracting the communication patterns by identifying the repeated sequences of MPI calls in each process trace. The output of this step is a sequence of all instances of the detected communication patterns in the trace ordered using the happened-before relationship. Second, we locate the communication patterns within specific trace segments using information theory

Evaluation

We tested our approach on five traces generated from three HPC systems: SMG2000 (Brown et al., 2000), AMG2013 (ASC, 2013), and the NAS BT parallel benchmark (NAS, 1994).

To generate traces, we instrumented the applications statically using the Score-P (VI-HPS, 2022) tool, which is perhaps one of the most recommended instrumentation tools for MPI-based systems. We instrumented all the functions of a system in the same way to ensure that the added overhead, though it is known to be low with

Conclusion

We presented a novel approach for the discovery of slow communication patterns in execution traces using statistical analysis techniques. Our approach can also be used for program comprehension to help analysts understand the behavior of the inter-process communication in MPI programs. Our approach is built on an improved version of our previous trace segmentation approach, which uses information theory principles to split a trace on execution phases. The approach relies on the detection of

Reproduction package

The implementation of our approach as well as the data used in the evaluation section are made available in the following repository: https://github.com/lalawneh/HPC-MPI-Traces.

CRediT authorship contribution statement

Luay Alawneh: Conceptualization, Methodology, Literature review, Trace generation, Models and techniques, Implementation and experiments, Validation, Writing – review & editing. Abdelwahab Hamou-Lhadj: Conceptualization, Methodology, Validation, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the Deanship of Research at Jordan University of Science & Technology for funding this research (ID. 20200285). Also, the first author, Dr. Luay Alawneh, would like to thank Concordia University, Canada, where this research was conducted during his sabbatical leave.

Dr. Luay Alawneh is an associate professor in the Department of Software Engineering at Jordan University of Science and Technology, Irbid, Jordan. His research interests are in software engineering, software maintenance and evolution, parallel processing, high performance computing systems, machine learning, and deep learning. Luay received his Ph.D. in electrical and computer engineering from Concordia University in Canada. In addition to his research achievements, Luay possesses intensive

References (64)

  • TuC.A. et al.

    Investigating solutions for the development of a green bond market: Evidence from analytic hierarchy process

    Finance Res. Lett.

    (2020)
  • AbrahamI. et al.

    Communication complexity of byzantine agreement, revisited

  • AguilarX. et al.

    Automatic on-line detection of MPI application structure with event flow graphs

  • ASC, ., 2000. Advanced simulation and computing program: The ASC SMG 2000 benchmark code. URL...
  • ASC, ., 2013. Parallel algebraic multigrid solver for linear systems: The ASC AMG 2013 benchmark code. URL...
  • BöhmeD. et al.

    Identifying the root causes of wait states in large-scale parallel applications

    ACM Trans. Parallel Comput. (TOPC)

    (2016)
  • BöhmeD. et al.

    Scalable critical-path based performance analysis

  • BrownP.N. et al.

    Semicoarsening multigrid on distributed memory machines

    SIAM J. Sci. Comput.

    (2000)
  • CasasM. et al.

    Automatic phase detection of MPI applications

    Parallel Comput. Archit. Algorithms Appl.

    (2007)
  • CasasM. et al.

    Automatic phase detection and structure extraction of MPI applications

    Int. J. High Perform. Comput. Appl.

    (2010)
  • CheongS.-A. et al.

    The context sensitivity problem in biological sequence segmentation

    (2009)
  • ChetsaG.L.T. et al.

    A user friendly phase detection methodology for hpc systems’ analysis

  • DabbaghM. et al.

    Functional and non-functional requirements prioritization: empirical evaluation of IPA, AHP-based, and HAM-based approaches

    Soft Comput.

    (2016)
  • DaremaF.

    The spmd model: Past, present and future

  • ErikssonJ. et al.

    Profiling and tracing tools for performance analysis of large scale applications

    PRACE: Partnersh. Adv. Comput. Europe

    (2016)
  • EschweilerD. et al.

    Open trace format 2: The next generation of scalable trace formats and support libraries

  • GallardoE. et al.

    Employing MPI_T in MPI advisor to optimize application performance

    Int. J. High Perform. Comput. Appl.

    (2018)
  • GeimerM. et al.

    The Scalasca performance toolset architecture

    Concurr. Comput.: Pract. Exper.

    (2010)
  • GonzalezJ. et al.

    Automatic detection of parallel applications computation phases

  • GonzalezJ. et al.

    Automatic refinement of parallel applications structure detection

  • GrosseI. et al.

    Analysis of symbolic sequences using the Jensen-Shannon divergence

    Phys. Rev. E

    (2002)
  • GusfieldD.

    Algorithms on stings, trees, and sequences: Computer science and computational biology

    ACM SIGACT News

    (1997)
  • Cited by (0)

    Dr. Luay Alawneh is an associate professor in the Department of Software Engineering at Jordan University of Science and Technology, Irbid, Jordan. His research interests are in software engineering, software maintenance and evolution, parallel processing, high performance computing systems, machine learning, and deep learning. Luay received his Ph.D. in electrical and computer engineering from Concordia University in Canada. In addition to his research achievements, Luay possesses intensive industrial experience in software engineering and software development from North American firms.

    Dr. Abdelwahab Hamou-Lhadj is a Professor in the Department of ECE at the Gina Cody School of Engineering and Computer Science, Concordia University, Montreal, Canada. His research interests are in software engineering, AI for IT operations, software tracing and logging, system observability, and model-driven engineering. He received his Ph.D. from the University of Ottawa, Canada. He is a senior member of IEEE, a long-lasting member of ACM, and a professional engineer with OIQ. He is also a frequent contributor to the Object Management Group (OMG) certification programs, OCUP 2 and OCEB 2.

    Editor: Dr Earl Barr.

    View full text