ABSTRACT
Recent research advocates asymmetric multi-core architectures, where cores in the same processor can have different performance. These architectures support single-threaded performance and multithreaded throughput at lower costs (e.g., die size and power). However, they also pose unique challenges to operating systems, which traditionally assume homogeneous hardware. This paper presents AMPS, an operating system scheduler that efficiently supports both SMP-and NUMA-style performance-asymmetric architectures. AMPS contains three components: asymmetry-aware load balancing, faster-core-first scheduling, and NUMA-aware migration. We have implemented AMPS in Linux kernel 2.6.16 and used CPU clock modulation to emulate performance asymmetry on an SMP and NUMA system. For various workloads, we show that AMPS achieves a median speedup of 1.16 with a maximum of 1.44 over stock Linux on the SMP, and a median of 1.07 with a maximum of 2.61 on the NUMA system. Our results also show that AMPS improves fairness and repeatability of application performance measurements.
- J. B. Andrews and C. D. Polychronopoulos. An analytical approach to performance/cost modeling of parallel computers. Journal of Parallel and Distributed Computing, 12(4):343--356, Aug. 1991. Google ScholarDigital Library
- M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 298--309, June 2005. Google ScholarDigital Library
- S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The impact of performance asymmetry in emerging multicore architectures. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 506--517, June 2005. Google ScholarDigital Library
- C. Banino, O. Beaumont, L. Carter, J. Ferrante, A. Legrand, and Y. Robert. Scheduling strategies for master-slave tasking on heterogeneous processor platforms. IEEE Transactions on Parallel and Distributed Systems, 15(4):319--330, Apr. 2004. Google ScholarDigital Library
- M. A. Bender and M. O. Rabin. Scheduling Cilk multithreaded parallel programs on processors of different speeds. In Proceedings of the Twelfth ACM Symposium on Parallel Algorithms and Architectures, pages 13--21, July 2000. Google ScholarDigital Library
- S. Y. Borkar, P. Dubey, K. C. Kahn, D. J. Kuck, H. Mulder, S. S. Pawlowski, and J. Rattner. Platform 2015: Intel® processor and platform evolution for the next decade. White Paper, Intel Corporation, 2005.Google Scholar
- R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum. Scheduling and page migration for multiprocessor compute servers. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 12--24, Oct. 1994. Google ScholarDigital Library
- M. DeVuyst, R. Kumar, and D. M. Tullsen. Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors. In Proceedings of the 20th International Parallel and Distributed Processing Symposium, Apr. 2006. Google ScholarDigital Library
- A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In Proceedings of the 2005 USENIX Annual Technical Conference, pages 395--398, Apr. 2005. Google ScholarDigital Library
- R. J. O. Figueiredo and J. A. B. Fortes. Impact of heterogeneity on DSM performance. In Proceedings of the Sixth IEEE Symposium on High-Performance Computer Architecture, pages 26--35, Jan. 2000.Google Scholar
- S. Ghiasi, T. Keller, and F. Rawson. Scheduling for heterogeneous processors in server systems. In Proceedings of the 2nd Conference on Computing Frontiers, pages 199--210, May 2005. Google ScholarDigital Library
- R. A. Hankins, G. N. Chinya, J. D. Collins, P. H. Wang, R. Rakvic, H. Wang, and J. P. Shen. Multiple instruction stream processor. In Proceedings of the 33rd Annual International Symposium on Computer Architecture, pages 114--127, June 2006. Google ScholarDigital Library
- R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, pages 81--92, Dec. 2003. Google ScholarDigital Library
- R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessing. IEEE Computer, 38(11):32--38, Nov. 2005. Google ScholarDigital Library
- R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st Annual International Symposium on Computer Architecture, pages 64--75, June 2004. Google ScholarDigital Library
- R. P. LaRowe, Jr., C. S. Ellis, and L. S. Kaplan. The robustness of NUMA memory management. In Proceedings of the 13th ACM Symposium on Operating System Principles, pages 137--151, Oct. 1991. Google ScholarDigital Library
- Linux Kernel Mailing List. Scalability of signal delivery for POSIX threads. http://lkml.org/lkml/2004/11/22/432, Nov. 2004.Google Scholar
- D. Menascé and V. Almeida. Cost-performance analysis of heterogeneity in supercomputer architectures. In Proceedings of the 1990 International Conference on Supercomputing, pages 169--177, June 1990. Google ScholarDigital Library
- D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa. The design and implementation of a first generation CELL* processor. In IEEE International Solid-State Circuits Conference Digest of Technical Papers, pages 184--185, Feb. 2005.Google ScholarCross Ref
- M. D. Powell, M. Gomaa, and T. Vijaykumar. Heat-and-run: Leveraging SMT and CMP to manage power density through the operating system. In Proceedings of the Eleventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 260--270, Oct. 2004. Google ScholarDigital Library
- G. C. Sih and E. A. Lee. A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Transactions on Parallel and Distributed Systems, 4(2):175--187, Feb. 1993. Google ScholarDigital Library
- A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 234--244, Nov. 2000. Google ScholarDigital Library
- H. Topcuoglu, S. Hariri, and M.-Y. Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3):260--274, Mar. 2002. Google ScholarDigital Library
- V. Uhlig, J. LeVasseur, E. Skoglund, and U. Dannowski. Towards scalable multiprocessor virtual machines. In Proceedings of the 3rd Virtual Machine Research and Technology Symposium, pages 43--56, May 2004. Google ScholarDigital Library
- H. Zheng and J. Nieh. SWAP: A scheduler with automatic process dependency detection. In Proceedings of the First Symposium on Networked Systems Design and Implementation, pages 183--196, Mar. 2004. Google ScholarDigital Library
- Efficient operating system scheduling for performance-asymmetric multi-core architectures
Recommendations
Virtualizing performance asymmetric multi-core systems
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecturePerformance-asymmetric multi-cores consist of heterogeneous cores, which support the same ISA, but have different computing capabilities. To maximize the throughput of asymmetric multi-core systems, operating systems are responsible for scheduling ...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil
Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time-consuming, ...
Comments