SIMD stealing: Architectural support for efficient data parallel execution on multicores

https://doi.org/10.1016/j.micpro.2018.12.001Get rights and content

Abstract

Single-instruction multiple-data (SIMD) architecture is a promising and widely used avenue for enhancing performance. Most of current multicore systems adopt this technique in which each core is equipped with an SIMD engine. However, these SIMD engines are frequently underutilized in single-thread applications. A typical reason is that only one SIMD engine can be utilized for a single thread, even if other SIMD engines are idle, and abundant data-level parallelism (DLP) can be exploited. To address this problem, we propose SIMD Stealing, which is an architectural support for multicore systems that provide the capability of dynamically adjusting the number of SIMD engines during application execution. This approach includes hardware modification and compilation extension, which can improve SIMD efficiency significantly in DLP applications. Experiment results show that SIMD stealing achieves an energy-delay product (EDP) reduction of approximately 53% for kernels and an EDP reduction of 34% for applications on average, with a small area overhead.

Introduction

Current multicore systems are efficient for handling applications where coarse-grain thread-level parallelism (TLP) is abundant. These systems can perform each thread in a physical core and complete numerous works in a short period. However, mapping applications that lack coarse-grained TLPs, such as single-thread applications, on multiple cores are inefficient. However, this type of applications remains recognized as an important form in a multicore era [1]. Many of these applications have an abundance of data level parallelism (DLP) and can be effectively exploited by a single-instruction multiple-data (SIMD) architecture [2], [3]. Fig. 1 illustrates the general structure of current multicore architecture. Each processor core in the multicore system is equipped with an SIMD engine. Examples include IBM’s AltiVec [2], Intel’s AVX [4], ARM’s Neon, and the recent SVE [5], [6]. The utilization efficiency of the processor core and memory subsystem can be improved significantly by leveraging DLP.

Prior works have found that mapping parallelism onto SIMD engines is more efficient than a multicore method in terms of performance and power given inter-core communication and high-memory/cache overheads [3]. The utilization of an SIMD engine has been a significant issue [7], [8]. For decades, massive research has been conducted to improve this engine [3], [4], [9], [10], [11], [12], [13], [14]. Presently, many such problems have been solved through hardware and software improvements, thereby resulting in reasonable performance gains [6], [15]. However, the width of an SIMD engine (typically less than eight words) is inadequate to fully exploit DLP. This restriction limits the multicore architecture of boosting the performance of single-thread applications. Wide SIMD engines, such as Intel’s 256-bit AVX [4] and the recent 512-bit extension AVX-512 [15], have a significant area overhead for each core. However, these engines are especially costly for embedded or mobile multicore systems with a modest hardware budget. Furthermore, designing a wide SIMD engine for each core could lead to resource waste for applications with limited DLP. This method is a contradiction of using a constant SIMD width to satisfy various application requirements.

ARM introduced a scalable vector extension (SVE) to address the SIMD width problem [6]. The SVE can support various SIMD widths, ranging from 128 bits to 2048 bits, in increments of 128 and does not require recompiling or rewriting hand-coded SVE assembler or C intrinsics when running on a processor with different SIMD widths. Similar to SVE, RISC-V instruction set architecture also introduces a vector extension, which provides support for variable-length vectors [16]. This architecture also provides a possibility for runtime variable vector lengths. However, SVE and RISC-V vector extensions have a utilization efficiency problem of multiple SIMD engines because the maximum widths of their SIMD engines are fixed for each concrete processor implementation, and only one SIMD engine can be used for a single-thread DLP application. Owing to multiple SIMD engines in the multicore system, we must determine if multiple SIMD engines can be used to execute single-thread DLP applications efficiently without changing their original architectural functions.

This work proposes SIMD stealing to address the above mentioned problem. It is a simple architectural modification to a multicore system that can boost performance for data-parallel applications when their DLP is abundant, and the SIMD engines in other cores are idle. SIMD stealing provides the capability of dynamically adjusting the number of SIMD engines during the execution period. It includes a computation stealing unit (CSU) in each SIMD engine and extends the SIMD compiler to use it. CSU detects these engines when other SIMD engines are idle and transmits SIMD instructions to corresponding engines for execution. Our experimental evaluation results show that SIMD stealing has significant benefits for DLP applications with a small area overhead.

This work offers the following contributions:

  • The “stealing” technique is introduced to dynamically adjust the number of SIMD engines during execution on a multicore architecture, thereby achieving an efficient SIMD execution.

  • We present a detailed architectural support for SIMD stealing, including hardware modification and compiler extension, which can facilitate SIMD stealing.

  • We report the experimental data of DLP kernel and the application benchmarks that show the effectiveness of SIMD stealing.

This paper is organized as follows. In Section 2, we introduce the motivation for this work. The detailed organization of SIMD stealing is described in Section 3. Section 4 provides the evaluation results. Section 5 discusses related works, and Section 6 presents the conclusion drawn from this work.

Section snippets

Motivation

To improve the performance of applications without coarse-grain TLP for multicore systems, an important solution is to exploit loop-level parallelism. In loop-level parallelism, the iterations of a loop can be executed in parallel. Fig. 2(a) depicts an example loop with 32 iterations in the loop body on a 4-core system, and each core has a 128-bit SIMD engine. Loops can execute in parallel without synchronization when these loops are identified as parallel iterations that are independent of one

SIMD stealing

The basic idea of SIMD stealing is that, when an SIMD engine discovers itself running in a specially designed stealing mode, it greedily “steals” computations from other SIMD engines in other cores when they are idle. Such stealing is different from work stealing, which is well-known in parallel computing [19]. Work stealing steals work from a busy thread, whereas SIMD stealing steals SIMD resources from an idle thread. The stolen object is different between the two methods. The following text

Methodology

The evaluation is performed using an AltiVec-like SIMD engine, which represents a typical SIMD design [2]. We use a Gem5 architecture-level simulator [24] that is integrated into a McPAT tool [25] as the baseline simulator for the performance and power modeling of the target multicore system to evaluate SIMD stealing. A Gem5 simulator is used to obtain the application performance, and McPAT is used to obtain the power consumption. We configure four cores with a shared bus, and each core is

Related works

SIMD engines are popular for increasing the performance and efficiency of microprocessor designs [2], [4], [5], [15]. However, they have several serious utilization problems [14]. Many hardware and software solutions have been proposed to improve the overhead of SIMD engines. These solutions can be classified into three types based on overhead sources, namely, memory alignment [27], data reorganization [14], and control flow [12]. Stream programming is a promising method for efficiently

Conclusion and future work

This work proposes SIMD stealing, a simple architectural modification to the multicore system to obtain an improved utilization efficiency for multiple SIMD engines. The SIMD stealing technique is used to provide the capability of adjusting the number of SIMD engines dynamically during execution. This technique combines multiple independent SIMD units of different cores into a large SIMD unit in case several of the cores do not require their SIMD units temporarily (e.g., when they are idle).

Acknowledgments

This work was supported by HGJ of China (under Grant 2018ZX01029103), the NSF of China (under Grants 61433019, 61872374, 61472435, 61572508, 61672526, 61202129, and U14352217), Young Elite Scientists Sponsorship Program By CAST (under Grant YESS-20150090), Research Project of NUDT (under Grant ZK17-03-06), and Science and Technology Innovation project of Hunan Province (under Grant 2018RS3083).

Libo Huang received the B.S and Ph.D. degree in computer engineering from National University of Defense Technology, China, in 2005 and 2010 respectively. He is associated professor at School of Computer, National University of Defense Technology. His research interests include computer architecture, hardware/software co-design, VLSI design, on-chip communication. He authored more than 50 papers in internationally recognized journals and conferences.

References (44)

  • N. Binkert et al.

    The gem5 simulator

    SIGARCH Comput. Archit. News

    (2011)
  • H. Zhong et al.

    Extending multicore architectures to exploit hybrid parallelism in single-thread applications

    HPCA ’07, USA

    (2007)
  • IBM

    Powerpc microprocessor family: vector/SIMD multimedia extension technology programming environments manual

    (2005)
  • A.H. Hormati et al.

    MacroSS: macro-SIMDization of streaming applications

    ASPLOS ’10

    (2010)
  • Intel

    Intel AVX: new frontiers in performance improvements and energy efficiency

    Intel White Pap.

    (2008)
  • ARM, Neon technology, in: http://www.arm.com/products/CPUs/NEON.html,...
  • N. Stephens

    ARMv8-A next generation vector architecture for HPC

    (2016)
  • N. Slingerland et al.

    Measuring the performance of multimedia instruction sets

    IEEE Trans. Comput.

    (2002)
  • S.K. Raman et al.

    Implementing streaming SIMD extensions on the pentium III processor

    IEEE Micro

    (2000)
  • N. Clark et al.

    Liquid SIMD: abstracting SIMD hardware using lightweight dynamic mapping

    HPCA’07

    (2007)
  • G. Ren et al.

    Optimizing data permutations for SIMD devices

    PLDI ’06

    (2006)
  • J. Shin et al.

    Superword-level parallelism in the presence of control flow

    (2005)
  • J. Shin

    Introducing control flow into vectorized code

    (2007)
  • FSF, Auto-vectorization in GCC, http://gcc.gnu.org/projects/treessa/vectorization.html....
  • L. Huang et al.

    SIF: overcoming the limitations of SIMD devices via implicit permutation

    HPCA 16, USA

    (2010)
  • J. Reinders

    AVX-512 instructions

    Developer Zone

    (2015)
  • K.A. Andrew Waterman

    The RISC-V Instruction Set Manual Volumn I: User-level ISA, Document Version 2.2

    (2017)
  • C. Madriles et al.

    Boosting single-thread performance in multi-core systems through fine-grain multi-threading

    ISCA ’09

    (2009)
  • OpenMP, OpenMP application program interface version 4.0, http://openmp.org/wp/openmp-specifications/,...
  • R.D. Blumofe et al.

    Scheduling multithreaded computations by work stealing

    J. ACM

    (1999)
  • L. Huang et al.

    Vbon: Towards efficient on-chip networks via hierarchical virtual bus

    (2012)
  • D.F. Bacon et al.

    Compiler transformations for high-performance computing

    ACM Comput. Surv.

    (1994)
  • Cited by (0)

    Libo Huang received the B.S and Ph.D. degree in computer engineering from National University of Defense Technology, China, in 2005 and 2010 respectively. He is associated professor at School of Computer, National University of Defense Technology. His research interests include computer architecture, hardware/software co-design, VLSI design, on-chip communication. He authored more than 50 papers in internationally recognized journals and conferences.

    Yashuai Lü received his PhD degree from National University of Defense Technology in 2009. His major field of study is computer architecture. Now he works at the Space Engineering University, China. His main research interests include processor architecture and computer graphics. He authored more than 20 papers in internationally recognized journals and conferences.

    Sheng Ma received the B.S. and Ph.D. degrees in computer science and technology from the National University of Defense Technology (NUDT) in 2007 and 2012, respectively. He visited the University of Toronto from Sept. 2010 to Sept. 2012. He is currently an Assistant Professor of the School of Computer, NUDT. His research interests include on-chip networks, SIMD architectures and arithmetic unit designs. He authored more than 30 papers in internationally recognized journals and conferences.

    Nong Xiao (M’04) received the Ph.D. degree in computer science and technology from the National University of Defense Technology, P.R. China in 1996. From 2004, he was a Professor with the Department of Computer Science. Now, he is also a senior Member with National Key Laboratory for Parallel and Distributed Processing of China. His research interests include Computer Architecture, Embedded System, Grid Computing and Large-scale Storage. He became a Member (M) of IEEE and ACM in 2004. Prof. XIAO has contributed 4 invited chapters to book volumes and published more than 100 papers in archival journals and refereed conference proceedings. He has developed many grid computing products, e.g. the monitor system of China Grid.

    Zhiying Wang received the Ph.D. degree in electrical engineering from computer science and technology from the National University of Defense Technology, P.R. China, in 1989. He is currently the Deputy Dean and Professor of computer engineering with Department of Computer, National University of Defense Technology, Hunan, China. He has contributed 10 invited chapters to book volumes, published 200 papers in archival journals and refereed conference proceedings, and delivered over 30 keynotes. His current research projects include asynchronous microprocessor design, nanotechnology circuits and systems based on Optoelectronic technology and virtual computer system. Prof. Wang became a Member (M) of IEEE and ACM in 2002 and 2003 respectively. His main research fields include computer architecture, computer security, VLSI design, reliable architecture, multi-core memory system and asynchronous circuit.

    View full text