Performance-asymmetry-aware scheduling for Chip Multiprocessors with static core coupling

https://doi.org/10.1016/j.sysarc.2010.09.003Get rights and content

Abstract

Thread-level redundancy is an efficient approach for transient fault detection and recovery in Chip Multiprocessors (CMPs), in which two adjacent cores are statically coupled to form a functional Dual Modular Redundancy (DMR). Manufacturing process variations cause core-to-core (C2C) performance asymmetry across the chip, which can be further divided into the asymmetry among core-pairs and the asymmetry within a core-pair. We call them inter- and intra-pair asymmetries, respectively, both of which should be taken into considerations in application scheduling for CMPs with static core coupling. In this paper, we first formulate the above scheduling problem as a 0–1 programming problem to maximize the system Weighted Throughput. An efficient IVF&AppSen algorithm is then proposed, which we prove to be optimal when the number of applications equals to that of core-pairs. We also adapt the Simulated Annealing technique to tackle this problem when applications are less than core-pairs on chip. Simulations on a 64-core CMP shows that the proposed algorithms achieve 2.5–9.3% improvement in Weighted Throughput when compared to prior VarF&AppIPC algorithm.

Introduction

Transient faults, also called soft errors, represent a critical reliability challenge for current and future Integrated Circuits (ICs). Soft errors occur when energetic particles strike and then invert the state of a device, such as a storage cell or a logic gate. The invert may further propagate to result in an execution error or failure in application programs. Even though, advancing of manufacturing technology slightly reduces the error rate of a single transistor, however, exponential growth of the number and density of transistors on a single die, results in, once again, considerably high error rate of an entire chip [21], [20], [13]. Therefore, it is essential to employ on-line fault detection and tolerance techniques to protect circuits and systems from soft errors.

Chip Multiprocessor (CMP) which integrates multiple homogeneous cores on a single silicon die in order to improve performance through parallel execution is generally regarded as the most promising architecture for future high performance computing, which has the benefits of power-efficiency and short time-to-market. As the number of on-chip cores increases, CMPs gradually migrate to many-core processors [22]. As there are many processing elements on CMPs and many-core chips, the inherent hardware redundancies provide opportunities to be explored for reliability purpose. Thread-level redundancy (TLR), which is widely adopted in multiprocessor systems [8], [9], prevails as one of the most efficient soft error detection and tolerance approaches in CMPs. Operating Systems (OS) duplicate the execution of threads on separate cores to detect and recover from soft errors [23], [3], [10]. For chip-level thread redundant execution, there is always a slack of instructions between the two replicated threads, through which the leading thread can forward load values and branch targets to the trailing thread, thus accelerating the trailing thread. Since thread-level redundancy requires frequent communication and synchronization between a pair of cores, i.e., leading and trailing threads, TLR typically couples two adjacent cores statically with glue logics in between in the literature. As depicted in Fig. 1, communication channels and buffer queues are used to bind two adjacent cores to support thread redundant execution. We call it CMP with Static Core Coupling (CMP-SCC) architecture in this paper. CMP-SCC is much similar to Paceline structure as introduced in [4].

Process variation [2], [17], [18] is another important issue that cannot be ignored during system and architecture design phases, as precise control of manufacturing process becomes extremely difficult, if not impossible. For CMPs, within-die variation causes individual core frequency or leakage characteristics to differ significantly from each other, even though they are homogeneous in architectures. According to Intel’s research [2], the maximum difference in core frequencies could be approximately 20% at 90 nm technology. In the presence of core-to-core variations in CMPs, prior research work focused on the scheduling problem of applications to different cores to maximize system throughput [6] or to reduce power consumption [5].

However, core-to-core performance asymmetries pose new challenges for application scheduling problem in CMP-SCC architecture. In CMP-SCC as shown in Fig. 1, process variation will result in performance gap in between a core-pair. To run ahead, the leading thread should be dispatched to a higher performance core, which we call it leading core, while the trailing thread to the weak core within the pair, which is called trailing core. The performance asymmetry between leading and trailing cores within a pair is named intra-pair variation, while the asymmetry among leading cores in different core-pairs is named inter-pair variation in this paper. Prior solutions on variation-aware scheduling in CMPs only considered the inter-pair case, which is no longer applicable for CMP-SCC. For CMP-SCC, the scheduling problem is different. Thread execution will be assigned to a core-pair and the performance gap between leading and trailing cores, i.e., intra-pair variation, will also affect the behavior of the application. Thus, the scheduling problem in CMP-SCC should take both inter- and intra-pair variations into account.

Based on the above analysis, in this paper we first evaluate the impact of intra-pair performance asymmetry on SPEC2000 benchmarks. We observe that different applications manifest different sensitivities to intra-pair performance asymmetry. For example, some applications such as gzip are greatly affected by intra-pair variation, while some others like swim are not sensitive to this kind of variation. We adopt a Weighted Throughput metric to evaluate the variation-aware application scheduling in CMP-SCC, and then formalize it as a 0–1 programming problem. An efficient scheduling algorithm, called IVF&AppSen, is then proposed to tackle this problem. We prove that when the number of applications to be scheduled is equal to the number of core-pairs on chip, IVF&AppSen can achieve optimal solutions. When the number of applications is less than that of core-pairs, however, the problem is NP-complete. The Simulated Annealing technique is adapted in this paper in this circumstance. Extensive simulation results on a 64-core CMP-SCC (32 pairs) show that the Weighted Throughput is improved by 2.5–9.3% when compared to VarF&AppIPC algorithm [5], which considers leading core frequencies only.

The rest of this paper is organized as follows: Section 2 reviews prior related work and motivates this paper. In Section 3, we analyze the impact of intra-pair performance asymmetry on redundant execution of SPEC2000 benchmarks. The variation-aware scheduling problem for CMP-SCC is then formalized in Section 4. Section 4 also describes the proposed scheduling algorithms. Simulation results are shown in Section 5. Section 6 concludes the paper.

Section snippets

Chip-level thread redundancy in CMPs

Traditional multiprocessor systems, i.e., IBM Z900 [9] and Compaq NonStop Himalaya [8] both employ thread-level redundancy (TLR) [14], [3], [4], [15], [16] for high reliability and availability, in which the execution of the same instruction is checked in a clock-by-clock basis, i.e., lockstep. System error rate due to transient faults continues to increase as technology advances into nanometer scale. All kinds of applications, not only high reliable ones require redundancy to ensure

The impact of intra-pair asymmetry on thread redundant execution

Fig. 2 illustrates the adopted microarchitecture of CMP-SCC in this paper, which is much similar to CRT in [11]. The memory access operations, i.e., load and store instructions are all performed by leading cores. The load values of leading core from memory hierarchy are forwarded and stored in LVQ (load value queue). When trailing core needs the same data, it will fetch it from the head of LVQ. The performance gap and execution slack also enable the correct branch targets generated by leading

Problem formulation

Threads running on two statically coupled cores will have performance degradation when compared with single execution on the leading core. To take both inter- and intra-pair variation into consideration, we use throughput measured in millions of instructions per second (MIPS) to evaluate such degradation. Though the instruction count is doubled in TLR, we only consider the instructions executed in the leading core, since the trailing thread is functionally transparent from the view point of

Experiment I

In this section, we first evaluate the proposed IVF&AppSen scheduling algorithm on a 64-core CMP-SCC (i.e., 32 core-pairs). The 32 applications to be scheduled consist of the 9 SPEC2000 benchmark programs listed in Section 3. The core frequencies are randomly generated with 2 GHz as the expectation and 20% as the variation. Teodorescu and Torrellas proposed a VarF&AppIPC algorithm in [5] to map applications with highest IPC on cores with highest frequency, thus only considering inter-pair

Conclusion

One appealing aspect of Chip Multiprocessors is the inherent redundancy of hardware resources, which can be exploited for soft error detection and recovery. Chip-level thread redundancy is considered to be one such efficient approach. Manufacturing process variations cause cores’ performance differ significantly across a chip. For CMPs with static core coupling (CMP-SCC), such core-to-core performance asymmetry can be further divided into inter-pair and intra-pair asymmetries. The former is the

Jianbo Dong received his B.Eng. degree in Computer Science from Hebei University of Technology, Tianjin, China, in 2007, and is now a Ph.D. candidate in Computer Science from Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include network-on-chip, design for reliability and multi-core/many-core processors.

References (24)

  • The SESC Simulator....
  • E. Humenay, D. Tarjan, K. Skadron, Impact of parameter variations on multi-core chips, in: Proceedings of the Workshop...
  • S.S. Mukherjee, M. Kontz, S.K. Reinhardt, Detailed design and evaluation of redundant multithreading alternatives, in:...
  • B. Greskamp, J. Torrellas, Paceline: improving single-thread performance in nanoscale CMPs through core overclocking,...
  • R. Teodorescu, J. Torrellas, Variation-aware application scheduling and power management for chip multiprocessors, in:...
  • P. Ndai et al.

    Within-die variation-aware scheduling in superscalar processors for improved throughput

    IEEE Transaction on Computer

    (2008)
  • N. Lakshminarayana, S. Rao, H. Kim, Asymmetry aware scheduling algorithms for asymmetric multiprocessor, in: Workshop...
  • Compaq Computer Corporation, Data Integrity for Compaq Non-Stop Himalaya Servers, 1999....
  • T.J. Slegel et al., IBM’s S/390 G5 microprocessor design, in: Proceedings of the Annual IEEE/ACM International...
  • C. LaFrieda, E. Ipek, J. Martinez, R. Manohar, Utilizing dynamically coupled cores to form a resilient chip...
  • M. Gomaa, C. Scarbrough, T.N. Vijaykumar, I. Pomeranz, Transient-fault recovery for chip multiprocessors, in:...
  • K. Srinivasan et al.

    Integer linear programming and heuristic techniques for system-level low power scheduling on multiprocessor architectures under throughput constraints

    Integration VLSI

    (2007)
  • Cited by (0)

    Jianbo Dong received his B.Eng. degree in Computer Science from Hebei University of Technology, Tianjin, China, in 2007, and is now a Ph.D. candidate in Computer Science from Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include network-on-chip, design for reliability and multi-core/many-core processors.

    Lei Zhang received his B.Eng. degree in Computer Science from University of Electronic Science and Technology of China (UESTC), Sichuan, China, in 2003, and the Ph.D. degree in Computer Science from Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2008.

    He is currently an Assistant Professor at the Institute of Computing Technology, Chinese Academy of Sciences. His research interests include network-on-chip, design for reliability and multi-core/many-core processors.

    Yinhe Han (M’06) received the B.Eng. degree from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2001, and the M. Eng. and Ph.D. degree in Computer Science from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2003 and 2006, respectively.

    He is currently an Associate Professor at the Institute of Computing Technology, Chinese Academy of Sciences. His research interests include VLSI design and test, reliable architecture design. Dr. Han was a recipient of the Test Technology Technical Council Best Paper Award at the Asian Test Symposium 2003. He is a member of IEEE/ ACM /CCF/IEICE.

    Guihai Yan received his B. Sc. from Peking University in 2005. Mr. Yan now is an Ph.D. candidate in Computer Science at Institute of Computing Technology (ICT), Chinese Academy of Sciences. His research interests include ASIC design, computer microarchitecture, with emphasis on design for reliability, variation tolerance, and low-power circuits. He is a student member of IEEE.

    Xiaowei Li (SM’04) received his B.Eng. and M.Eng. degrees in Computer Science from Hefei University of Technology, China, in 1985 and 1988, respectively, and the Ph.D. degree in Computer Science from the Institute of Computing Technology, Chinese Academy of Sciences, in 1991.

    From 1993 to 2000, he was an associate professor in the Department of Computer Science, Peking University, China. During 1997 and 1998, he was a Visiting Research Fellow in the Department of Electrical and Electronic Engineering, University of Hong Kong. During 1999 and 2000, he was a Visiting Professor in the Graduate School of Information Science, Nara Institute of Science and Technology, Japan. He joined the Institute of Computing Technology, Chinese Academy of Sciences as a Professor in 2000. His research interests include VLSI testing, design verification, and dependable computing. He serves as a member of Editorial Board of the Journal of Computer Science and Technology and the Journal of Low Power Electronics, an Associate Editor-In-Chief of the Journal of Computer-Aided Design and Computer Graphics (in Chinese).

    He was a Technical Program Chair of IEEE Asian Test Symposium (ATS) in 2003, and Workshop of RTL and High Level Testing (WRTLT) in 2001. He was a General Chair of ATS in 2007, WRTLT in 2003. He serves on the Technical Program Committee of several IEEE and ACM conferences, including VTS, DATE, ASP-DAC and PRDC. He also serves as Asia-Pacific Regional TTTC Vice-Chair.

    The work was supported in part by National Basic Research Program of China (973) under Grant No. 2011CB302503, in part by National Natural Science Foundation of China (NSFC) under Grant Nos. 60806014, 60831160526, 60633060, 60921002, 61076037, 60906018, and in part by Hi-Tech Research and Development Program of China (863) under Grant No. 2009AA01Z126.

    View full text