SIMD stealing: Architectural support for efficient data parallel execution on multicores
Introduction
Current multicore systems are efficient for handling applications where coarse-grain thread-level parallelism (TLP) is abundant. These systems can perform each thread in a physical core and complete numerous works in a short period. However, mapping applications that lack coarse-grained TLPs, such as single-thread applications, on multiple cores are inefficient. However, this type of applications remains recognized as an important form in a multicore era [1]. Many of these applications have an abundance of data level parallelism (DLP) and can be effectively exploited by a single-instruction multiple-data (SIMD) architecture [2], [3]. Fig. 1 illustrates the general structure of current multicore architecture. Each processor core in the multicore system is equipped with an SIMD engine. Examples include IBM’s AltiVec [2], Intel’s AVX [4], ARM’s Neon, and the recent SVE [5], [6]. The utilization efficiency of the processor core and memory subsystem can be improved significantly by leveraging DLP.
Prior works have found that mapping parallelism onto SIMD engines is more efficient than a multicore method in terms of performance and power given inter-core communication and high-memory/cache overheads [3]. The utilization of an SIMD engine has been a significant issue [7], [8]. For decades, massive research has been conducted to improve this engine [3], [4], [9], [10], [11], [12], [13], [14]. Presently, many such problems have been solved through hardware and software improvements, thereby resulting in reasonable performance gains [6], [15]. However, the width of an SIMD engine (typically less than eight words) is inadequate to fully exploit DLP. This restriction limits the multicore architecture of boosting the performance of single-thread applications. Wide SIMD engines, such as Intel’s 256-bit AVX [4] and the recent 512-bit extension AVX-512 [15], have a significant area overhead for each core. However, these engines are especially costly for embedded or mobile multicore systems with a modest hardware budget. Furthermore, designing a wide SIMD engine for each core could lead to resource waste for applications with limited DLP. This method is a contradiction of using a constant SIMD width to satisfy various application requirements.
ARM introduced a scalable vector extension (SVE) to address the SIMD width problem [6]. The SVE can support various SIMD widths, ranging from 128 bits to 2048 bits, in increments of 128 and does not require recompiling or rewriting hand-coded SVE assembler or C intrinsics when running on a processor with different SIMD widths. Similar to SVE, RISC-V instruction set architecture also introduces a vector extension, which provides support for variable-length vectors [16]. This architecture also provides a possibility for runtime variable vector lengths. However, SVE and RISC-V vector extensions have a utilization efficiency problem of multiple SIMD engines because the maximum widths of their SIMD engines are fixed for each concrete processor implementation, and only one SIMD engine can be used for a single-thread DLP application. Owing to multiple SIMD engines in the multicore system, we must determine if multiple SIMD engines can be used to execute single-thread DLP applications efficiently without changing their original architectural functions.
This work proposes SIMD stealing to address the above mentioned problem. It is a simple architectural modification to a multicore system that can boost performance for data-parallel applications when their DLP is abundant, and the SIMD engines in other cores are idle. SIMD stealing provides the capability of dynamically adjusting the number of SIMD engines during the execution period. It includes a computation stealing unit (CSU) in each SIMD engine and extends the SIMD compiler to use it. CSU detects these engines when other SIMD engines are idle and transmits SIMD instructions to corresponding engines for execution. Our experimental evaluation results show that SIMD stealing has significant benefits for DLP applications with a small area overhead.
This work offers the following contributions:
- •
The “stealing” technique is introduced to dynamically adjust the number of SIMD engines during execution on a multicore architecture, thereby achieving an efficient SIMD execution.
- •
We present a detailed architectural support for SIMD stealing, including hardware modification and compiler extension, which can facilitate SIMD stealing.
- •
We report the experimental data of DLP kernel and the application benchmarks that show the effectiveness of SIMD stealing.
This paper is organized as follows. In Section 2, we introduce the motivation for this work. The detailed organization of SIMD stealing is described in Section 3. Section 4 provides the evaluation results. Section 5 discusses related works, and Section 6 presents the conclusion drawn from this work.
Section snippets
Motivation
To improve the performance of applications without coarse-grain TLP for multicore systems, an important solution is to exploit loop-level parallelism. In loop-level parallelism, the iterations of a loop can be executed in parallel. Fig. 2(a) depicts an example loop with 32 iterations in the loop body on a 4-core system, and each core has a 128-bit SIMD engine. Loops can execute in parallel without synchronization when these loops are identified as parallel iterations that are independent of one
SIMD stealing
The basic idea of SIMD stealing is that, when an SIMD engine discovers itself running in a specially designed stealing mode, it greedily “steals” computations from other SIMD engines in other cores when they are idle. Such stealing is different from work stealing, which is well-known in parallel computing [19]. Work stealing steals work from a busy thread, whereas SIMD stealing steals SIMD resources from an idle thread. The stolen object is different between the two methods. The following text
Methodology
The evaluation is performed using an AltiVec-like SIMD engine, which represents a typical SIMD design [2]. We use a Gem5 architecture-level simulator [24] that is integrated into a McPAT tool [25] as the baseline simulator for the performance and power modeling of the target multicore system to evaluate SIMD stealing. A Gem5 simulator is used to obtain the application performance, and McPAT is used to obtain the power consumption. We configure four cores with a shared bus, and each core is
Related works
SIMD engines are popular for increasing the performance and efficiency of microprocessor designs [2], [4], [5], [15]. However, they have several serious utilization problems [14]. Many hardware and software solutions have been proposed to improve the overhead of SIMD engines. These solutions can be classified into three types based on overhead sources, namely, memory alignment [27], data reorganization [14], and control flow [12]. Stream programming is a promising method for efficiently
Conclusion and future work
This work proposes SIMD stealing, a simple architectural modification to the multicore system to obtain an improved utilization efficiency for multiple SIMD engines. The SIMD stealing technique is used to provide the capability of adjusting the number of SIMD engines dynamically during execution. This technique combines multiple independent SIMD units of different cores into a large SIMD unit in case several of the cores do not require their SIMD units temporarily (e.g., when they are idle).
Acknowledgments
This work was supported by HGJ of China (under Grant 2018ZX01029103), the NSF of China (under Grants 61433019, 61872374, 61472435, 61572508, 61672526, 61202129, and U14352217), Young Elite Scientists Sponsorship Program By CAST (under Grant YESS-20150090), Research Project of NUDT (under Grant ZK17-03-06), and Science and Technology Innovation project of Hunan Province (under Grant 2018RS3083).
Libo Huang received the B.S and Ph.D. degree in computer engineering from National University of Defense Technology, China, in 2005 and 2010 respectively. He is associated professor at School of Computer, National University of Defense Technology. His research interests include computer architecture, hardware/software co-design, VLSI design, on-chip communication. He authored more than 50 papers in internationally recognized journals and conferences.
References (44)
- et al.
The gem5 simulator
SIGARCH Comput. Archit. News
(2011) - et al.
Extending multicore architectures to exploit hybrid parallelism in single-thread applications
HPCA ’07, USA
(2007) Powerpc microprocessor family: vector/SIMD multimedia extension technology programming environments manual
(2005)- et al.
MacroSS: macro-SIMDization of streaming applications
ASPLOS ’10
(2010) Intel AVX: new frontiers in performance improvements and energy efficiency
Intel White Pap.
(2008)- ARM, Neon technology, in: http://www.arm.com/products/CPUs/NEON.html,...
ARMv8-A next generation vector architecture for HPC
(2016)- et al.
Measuring the performance of multimedia instruction sets
IEEE Trans. Comput.
(2002) - et al.
Implementing streaming SIMD extensions on the pentium III processor
IEEE Micro
(2000) - et al.
Liquid SIMD: abstracting SIMD hardware using lightweight dynamic mapping
HPCA’07
(2007)
Optimizing data permutations for SIMD devices
PLDI ’06
Superword-level parallelism in the presence of control flow
Introducing control flow into vectorized code
SIF: overcoming the limitations of SIMD devices via implicit permutation
HPCA 16, USA
AVX-512 instructions
Developer Zone
The RISC-V Instruction Set Manual Volumn I: User-level ISA, Document Version 2.2
Boosting single-thread performance in multi-core systems through fine-grain multi-threading
ISCA ’09
Scheduling multithreaded computations by work stealing
J. ACM
Vbon: Towards efficient on-chip networks via hierarchical virtual bus
Compiler transformations for high-performance computing
ACM Comput. Surv.
Cited by (0)
Libo Huang received the B.S and Ph.D. degree in computer engineering from National University of Defense Technology, China, in 2005 and 2010 respectively. He is associated professor at School of Computer, National University of Defense Technology. His research interests include computer architecture, hardware/software co-design, VLSI design, on-chip communication. He authored more than 50 papers in internationally recognized journals and conferences.
Yashuai Lü received his PhD degree from National University of Defense Technology in 2009. His major field of study is computer architecture. Now he works at the Space Engineering University, China. His main research interests include processor architecture and computer graphics. He authored more than 20 papers in internationally recognized journals and conferences.
Sheng Ma received the B.S. and Ph.D. degrees in computer science and technology from the National University of Defense Technology (NUDT) in 2007 and 2012, respectively. He visited the University of Toronto from Sept. 2010 to Sept. 2012. He is currently an Assistant Professor of the School of Computer, NUDT. His research interests include on-chip networks, SIMD architectures and arithmetic unit designs. He authored more than 30 papers in internationally recognized journals and conferences.
Nong Xiao (M’04) received the Ph.D. degree in computer science and technology from the National University of Defense Technology, P.R. China in 1996. From 2004, he was a Professor with the Department of Computer Science. Now, he is also a senior Member with National Key Laboratory for Parallel and Distributed Processing of China. His research interests include Computer Architecture, Embedded System, Grid Computing and Large-scale Storage. He became a Member (M) of IEEE and ACM in 2004. Prof. XIAO has contributed 4 invited chapters to book volumes and published more than 100 papers in archival journals and refereed conference proceedings. He has developed many grid computing products, e.g. the monitor system of China Grid.
Zhiying Wang received the Ph.D. degree in electrical engineering from computer science and technology from the National University of Defense Technology, P.R. China, in 1989. He is currently the Deputy Dean and Professor of computer engineering with Department of Computer, National University of Defense Technology, Hunan, China. He has contributed 10 invited chapters to book volumes, published 200 papers in archival journals and refereed conference proceedings, and delivered over 30 keynotes. His current research projects include asynchronous microprocessor design, nanotechnology circuits and systems based on Optoelectronic technology and virtual computer system. Prof. Wang became a Member (M) of IEEE and ACM in 2002 and 2003 respectively. His main research fields include computer architecture, computer security, VLSI design, reliable architecture, multi-core memory system and asynchronous circuit.