Exploring the performance limits of simultaneous multithreading for memory intensive applications

Athanasaki, Evangelia; Anastopoulos, Nikos; Kourtis, Kornilios; Koziris, Nectarios

doi:10.1007/s11227-007-0149-x

Exploring the performance limits of simultaneous multithreading for memory intensive applications

Published: 06 October 2007

Volume 44, pages 64–97, (2008)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Evangelia Athanasaki¹,
Nikos Anastopoulos¹,
Kornilios Kourtis¹ &
…
Nectarios Koziris¹

169 Accesses
10 Citations
Explore all metrics

Abstract

Simultaneous multithreading (SMT) has been proposed to improve system throughput by overlapping instructions from multiple threads on a single wide-issue processor. Recent studies have demonstrated that diversity of simultaneously executed applications can bring up significant performance gains due to SMT. However, the speedup of a single application that is parallelized into multiple threads, is often sensitive to its inherent instruction level parallelism (ILP), as well as the efficiency of synchronization and communication mechanisms between its separate, but possibly dependent threads. Moreover, as these separate threads tend to put pressure on the same architectural resources, no significant speedup can be observed.

In this paper, we evaluate and contrast thread-level parallelism (TLP) and speculative precomputation (SPR) techniques for a series of memory intensive codes executed on a specific SMT processor implementation. We explore the performance limits by evaluating the tradeoffs between ILP and TLP for various kinds of instruction streams. By obtaining knowledge on how such streams interact when executed simultaneously on the processor, and quantifying their presence within each application’s threads, we try to interpret the observed performance for each application when parallelized according to the aforementioned techniques. In order to amplify this evaluation process, we also present results gathered from the performance monitoring hardware of the processor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Runtime-Aware Architectures

Dissecting the Phytium 2000+ Memory Hierarchy via Microbenchmarking

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Article 30 January 2021

References

Omni OpenMP compiler project (2003) Released in the international conference for high performance computing, networking and storage (SC’03), November 2003
Athanasaki E, Koziris N (2004) Fast indexing for blocked array layouts to improve multi-level cache locality. In: Proceedings of the 8th workshop on interaction between compilers and computer architectures (INTERACT’04), held in conjunction with HPCA-10, Madrid, Spain, February 2004, pp 109–119
Barrett R, Berry M, Chan T, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C, van der Vorst H (1994) Templates for the solution of linear systems: building blocks for iterative methods. SIAM, Philadelphia
Google Scholar
Bulpin J, Pratt I (2004) Multiprogramming performance of the Pentium 4 with hyper-threading. In: Proceedings of the third annual workshop on duplicating, deconstructing and debunking (WDDD 2004) held in conjunction with ISCA 04, Munich, Germany, June 2004, p 5362
Collins J, Wang H, Tullsen D, Hughes C, Lee Y-F, Lavery D, Shen J (2001) Speculative precomputation: long-range prefetching of delinquent loads. In Proceedings of the 28th annual international symposium on computer architecture (ISCA ’01), Göteborg, Sweden, July 2001, pp 14–25
Cormen T, Leiserson C, Rivest R (2001) Introduction to algorithms. MIT Press, Cambridge
MATH Google Scholar
Curtis-Maury M, Wang T, Antonopoulos C, Nikolopoulos D (2005) Integrating multiple forms of multithreaded execution on multi-SMT systems: a study with scientific applications. In: ICQES
Drepper U (2005) Futexes are tricky. December 2005
Intel Corporation. IA-32 Intel architecture optimization. Order Number: 248966-011
Intel Corporation (2001) Using spin-loops on Intel Pentium 4 processor and Intel Xeon processor. Order Number: 248674-002, May 2001
Kim D, Liao S-W, Wang P, del Cuvillo J, Tian X, Zou X, Wang H, Yeung D, Girkar M, Shen J (2004) Physical experimentation with prefetching helper threads on Intel’s hyper-threaded processors. In: Proceedings of the 2nd IEEE/ACM international symposium on code generation and optimization (CGO 2004), San Jose, CA, March 2004, pp 27–38
Lo J, Eggers S, Emer J, Levy H, Stamm R, Tullsen D (1997) Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading. ACM Trans Comput Syst 15(3):322–354
Article Google Scholar
Lo J, Eggers S, Levy H, Parekh S, Tullsen D (1997) Tuning compiler optimizations for simultaneous multithreading. In: Proceedings of the 30th annual ACM/IEEE international symposium on microarchitecture (MICRO-30), Research Triangle Park, NC, December 1997, pp 114–124
Luk C-K (2001) Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In: Proceedings of the 28th annual international symposium on computer architecture (ISCA ’01), Göteborg, Sweden, July 2001, pp 40–51
Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K (2005) In: Building customized program analysis tools with dynamic instrumentation. SIGPLAN Not 40(6):190–200
Article Google Scholar
Luk C-K, Mowry T (1996) Compiler-based prefetching for recursive data structures. In: Proceedings of the 7th international conference on architectural support for programming languages and operating systems (ASPLOS-VII), Boston, MA, October 1996, pp 222–233
Luk C-K, Mowry T (1999) Automatic compiler-inserted prefetching for pointer-based applications. IEEE Trans Comput 48(2):134–141
Article Google Scholar
Marr D, Binns F, Hill D, Hinton G, Koufaty D, Miller JA, Upton M (2002) Hyper-threading technology architecture and microarchitecture. Intel Technol J 6:4–15
Google Scholar
Mitchell N, Carter L, Ferrante J, Tullsen D (1999) ILP versus TLP on SMT. In: Proceedings of the 1999 ACM/IEEE conference on supercomputing (CDROM), November 1999
Mowry T (1998) Tolerating latency in multiprocessors through compiler-inserted prefetching. ACM Trans Comput Syst 16(1):55–92
Article Google Scholar
Mowry T, Lam M, Gupta A (1992) Design and evaluation of a compiler algorithm for prefetching. In: ASPLOS-V: proceedings of the fifth international conference on architectural support for programming languages and operating systems, New York, NY, USA. ACM Press, New York, pp 62–73
Google Scholar
Nethercote N, Seward J (2003) Valgrind: a program supervision framework. In: Proceedings of the 3rd workshop on runtime verification (RV’03), Boulder, CO, July 2003
Patterson D, Hennessy J (2003) Computer architecture. A quantitative approach, 3rd edn. Kaufmann, Los Altos
Google Scholar
Roth A, Sohi G (2001) Speculative data-driven Multithreading. In: Proceedings of the 7th international symposium on high performance computer architecture (HPCA ’01), Nuevo Leone, Mexico, January 2001, pp 37–48
Silberschatz A, Korth H, Sudarshan S (2001) Database systems concepts, 4th edn. McGraw–Hill/Higher Education, New York
Google Scholar
Sundaramoorthy K, Purser Z, Rotenberg E (2000) Slipstream processors: improving both performance and fault tolerance. In: Proceddings of the 9th international conference on architectural support for programming languages and operating systems (ASPLOS IX), Cambridge, MA, November 2000, pp 257–268
Temam O, Granston E, Jalby W (1993) To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In: Proceedings of the 1993 ACM/IEEE conference on supercomputing (SC’93), Portland, OR, November 1993, pp 410–419
Tuck N, Tullsen D (2003) Initial observations of the simultaneous multithreading Pentium 4 processor. In: Proceedings of the 12th international conference on parallel architectures and compilation techniques (PACT ’03), New Orleans, LA, September 2003
Tullsen D, Eggers S, Emer J, Levy H, Lo J, Stamm R (1996) Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor. In: Proceedings of the 23rd annual international symposium on computer architecture (ISCA ’96), Philadelphia, PA, May 1996, pp 191–202
Tullsen D, Eggers S, Levy H (1995) Simultaneous multithreading: maximizing on-chip parallelism. In: Proceedings of the 22nd annual international symposium on computer architecture (ISCA ’95), Santa Margherita Ligure, Italy, June 1995, pp 392–403
Wang H, Wang P, Weldon RD, Ettinger S, Saito H, Girkar M, Shih S, Liao W, Shen J (2002) Speculative precomputation: exploring the use of multithreading for latency. Intel Technol J 6(1):22–35
Google Scholar
Wang T, Blagojevic F, Nikolopoulos D (2004) Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors. In: Proceddings of the 7th ACM SIGPLAN workshop on languages, compilers, and runtime support for scalable systems (LCR’2004), Houston, TX, October 2004

Download references

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, Computing Systems Laboratory, National Technical University of Athens, Zografou Campus, Zografou, 15773, Greece
Evangelia Athanasaki, Nikos Anastopoulos, Kornilios Kourtis & Nectarios Koziris

Authors

Evangelia Athanasaki
View author publications
You can also search for this author inPubMed Google Scholar
Nikos Anastopoulos
View author publications
You can also search for this author inPubMed Google Scholar
Kornilios Kourtis
View author publications
You can also search for this author inPubMed Google Scholar
Nectarios Koziris
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Evangelia Athanasaki.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Athanasaki, E., Anastopoulos, N., Kourtis, K. et al. Exploring the performance limits of simultaneous multithreading for memory intensive applications. J Supercomput 44, 64–97 (2008). https://doi.org/10.1007/s11227-007-0149-x

Download citation

Received: 19 January 2006
Accepted: 13 August 2007
Published: 06 October 2007
Issue Date: April 2008
DOI: https://doi.org/10.1007/s11227-007-0149-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring the performance limits of simultaneous multithreading for memory intensive applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Runtime-Aware Architectures

Dissecting the Phytium 2000+ Memory Hierarchy via Microbenchmarking

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now