skip to main content
10.1145/2541940.2541954acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism

Published: 24 February 2014 Publication History

Abstract

The number of active threads in a multi-core processor varies over time and is often much smaller than the number of supported hardware threads. This requires multi-core chip designs to balance core count and per-core performance. Low active thread counts benefit from a few big, high-performance cores, while high active thread counts benefit more from a sea of small, energy-efficient cores.
This paper comprehensively studies the trade-offs in multi-core design given dynamically varying active thread counts. We find that, under these workload conditions, a homogeneous multi-core processor, consisting of a few high-performance SMT cores, typically outperforms heterogeneous multi-cores consisting of a mix of big and small cores (without SMT), within the same power budget. We also show that a homogeneous multi-core performs almost as well as a heterogeneous multi-core that also implements SMT, as well as a dynamic multi-core, while being less complex to design and verify. Further, heterogeneous multi-cores that power-gate idle cores yield (only) slightly better energy-efficiency compared to homogeneous multi-cores.
The overall conclusion is that the benefit of SMT in the multi-core era is to provide flexibility with respect to the available thread-level parallelism. Consequently, homogeneous multi-cores with big SMT cores are competitive high-performance, energy-efficient design points for workloads with dynamically varying active thread counts.

References

[1]
M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 298--309, June 2005.
[2]
L. A. Barroso and U. Hölzle. The case for energy-proportional systems. IEEE Computer, 40: 33--37, Dec. 2007.
[3]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Oct. 2008.
[4]
G. Blake, R. G. Dreslinski, T. N. Mudge, and K. Flautner. Evolution of thread-level parallelism in desktop applications. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 302--313, June 2010.
[5]
T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 52:1--52:12, Nov. 2011.
[6]
K. Du Bois, S. Eyerman, J. Sartor, and L. Eeckhout. Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 511--522, June 2013.
[7]
S. Eyerman and L. Eeckhout. System-level performance metrics for multi-program workloads. IEEE Micro, 28 (3): 42--53, May/June 2008.
[8]
P. Greenhalgh. Big.LITTLE processing with ARM Cortex-A15 & Cortex-A7: Improving energy efficiency in high-performance mobile platforms. http://www.arm.com/files/downloads/big\_LITTLE\_Final\_Final.pdf, Sept. 2011.
[9]
L. Hammond, M. Willey, and K. Olukotun. Data speculation support for a chip multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 58--69, Oct. 1998.
[10]
M. D. Hill and M. R. Marty. Amdahl's law in the multicore era. IEEE Computer, 41 (7): 33--38, July 2008.
[11]
E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core fusion: Accommodating software diversity in chip multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 186--197, June 2007.
[12]
J. A. Joao, M. A. Suleman, O. Mutlu, and Y. N. Patt. Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 223--234, Mar. 2012.
[13]
M. T. Jones. Inside the Linux scheduler: The latest version of this all-important kernel component improves scalability. http://www.ibm.com/developerworks/linux/library/l-scheduler/index.html, June 2006.
[14]
R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd. Power7: IBM's next-generation server processor. IEEE Micro, 30: 7--15, March/April 2010.
[15]
C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway. The AMD Opteron processor for multiprocessor servers. IEEE Micro, 23 (2): 66--76, Mar. 2007.
[16]
K. Khubaib, M. Suleman, M. Hashemi, C. Wilkerson, and Y. Patt. MorphCore: An energy-efficient microarchitecture for high performance ILP and high throughput TLP. In 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 305--316, Dec. 2012.
[17]
C. Kim, S. Sethumadhavan, M. S. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. Keckler. Composable lightweight processors. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 381--394, Dec. 2007.
[18]
R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proceedings of the ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), pages 81--92, Dec. 2003.
[19]
R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 64--75, June 2004.
[20]
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 469--480, Dec. 2009.
[21]
Y. Li, D. Brooks, Z. Hu, and K. Skadron. Performance, energy, and thermal considerations for SMT and CMP architectures. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 71--82, Feb. 2005.
[22]
NVidia. Variable SMP -- a multi-core CPU architecture for low power and high performance. http://www.nvidia.com/content/PDF/tegra\_white\_papers/Variable-SMP-A-Multi-%Core-CPU-Architecture-for-Low-Power-and-High-Performance-v1.1.pdf, 2011.
[23]
K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K.-Y. Chang. The case for a single-chip multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 2--11, Oct. 1996.
[24]
S. E. Raasch and S. K. Reinhardt. The impact of resource partitioning on SMT processors. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 15--26, Sept. 2003.
[25]
E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weissmann. Power-management architecture of the intel microarchitecture code-named sandy bridge. IEEE Micro, 32: 20--27, March/April 2012.
[26]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 45--57, Oct. 2002.
[27]
A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for simultaneous multithreading processor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 234--244, Nov. 2000.
[28]
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA), pages 414--425, June 1995.
[29]
M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 253--264, Mar. 2009.
[30]
M. A. Suleman, O. Mutlu, J. A. Joao, Khubaib, and Y. N. Patt. Data marshaling for multi-core architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 441--450, June 2010.
[31]
D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA), pages 191--202, May 1996.
[32]
R. Velasquez, P. Michaud, and A. Seznec. Selecting benchmark combinations for the evaluation of multicore throughput. In The IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 173--182, Apr. 2013.

Cited By

View all
  • (2021)LFOC+: A Fair OS-level Cache-Clustering Policy for Commodity Multicore SystemsIEEE Transactions on Computers10.1109/TC.2021.3112970(1-1)Online publication date: 2021
  • (2019)Your Containers Should be WYSIWYG2019 IEEE International Conference on Services Computing (SCC)10.1109/SCC.2019.00022(56-64)Online publication date: Jul-2019
  • (2019)Parallelism Analysis of Prominent Desktop Applications: An 18- Year Perspective2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2019.00033(202-211)Online publication date: Mar-2019
  • Show More Cited By

Index Terms

  1. The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
    February 2014
    780 pages
    ISBN:9781450323055
    DOI:10.1145/2541940
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 February 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. chip multi-core processor
    2. simultaneous multi-threading
    3. single-isa heterogeneous multi-core
    4. thread-level parallelism

    Qualifiers

    • Research-article

    Conference

    ASPLOS '14

    Acceptance Rates

    ASPLOS '14 Paper Acceptance Rate 49 of 217 submissions, 23%;
    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)LFOC+: A Fair OS-level Cache-Clustering Policy for Commodity Multicore SystemsIEEE Transactions on Computers10.1109/TC.2021.3112970(1-1)Online publication date: 2021
    • (2019)Your Containers Should be WYSIWYG2019 IEEE International Conference on Services Computing (SCC)10.1109/SCC.2019.00022(56-64)Online publication date: Jul-2019
    • (2019)Parallelism Analysis of Prominent Desktop Applications: An 18- Year Perspective2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2019.00033(202-211)Online publication date: Mar-2019
    • (2018)Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00030(247-258)Online publication date: Feb-2018
    • (2017)Improving IBM POWER8 Performance Through Symbiotic Job SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.269170828:10(2838-2851)Online publication date: 1-Oct-2017
    • (2017)A Software-Hardware Co-designed Methodology for Efficient Thread Level Speculation2017 IEEE International Conference on Computer and Information Technology (CIT)10.1109/CIT.2017.49(184-191)Online publication date: Aug-2017
    • (2016)Selecting Heterogeneous Cores for DiversityACM Transactions on Architecture and Code Optimization10.1145/301416513:4(1-25)Online publication date: 16-Dec-2016
    • (2016)TPCACM SIGARCH Computer Architecture News10.1145/2980024.287237044:2(129-141)Online publication date: 25-Mar-2016
    • (2016)TPCACM SIGOPS Operating Systems Review10.1145/2954680.287237050:2(129-141)Online publication date: 25-Mar-2016
    • (2016)TPCACM SIGPLAN Notices10.1145/2954679.287237051:4(129-141)Online publication date: 25-Mar-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media