A scalability prediction approach for multi-threaded applications on manycore processors

Bai, Xiuxiu; Wang, Endong; Dong, Xiaoshe; Zhang, Xingjun

doi:10.1007/s11227-015-1505-x

A scalability prediction approach for multi-threaded applications on manycore processors

Published: 22 August 2015

Volume 71, pages 4072–4094, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xiuxiu Bai¹,
Endong Wang²,
Xiaoshe Dong¹ &
…
Xingjun Zhang¹

312 Accesses
3 Altmetric
Explore all metrics

Abstract

In the manycore era, developing multi-threaded applications to efficiently leverage the increasing number of cores has become an emerging problem. However, each application can have different scalability because of the competition for shared resources, such as CPU cores, memory subsystem, or both, depending on the input set. Therefore, to obtain optimal performance of applications, it is crucial to dynamically predict the scalability of applications and allocate the appropriate number of threads to each application based on its scalability. In this paper, we propose bytes per instruction, which is a simple and effective model to provide insights into the scalability of multi-threaded applications, based on the analysis of the interactions among memory-level parallelism, instruction-level parallelism, and thread-level parallelism. Based on the BPI model, we propose (1) a classification approach and (2) scalability prediction algorithm for multi-threaded applications. Based on the scalability prediction algorithm, we implement the scalability-aware thread scheduling approach which can allocate the appropriate number of threads to optimize application performance. The evaluation results on a 61-core Intel Xeon Phi coprocessor show that our algorithm can predict the scalability of 120-, 180-, and 240-threaded applications with an average error of 6.8 %. Moreover, the accuracy of our prediction algorithm outperforms state-of-the-art instruction-level prediction and memory-level prediction by an average of 9.1 and 14.8 %, respectively. The scalability-aware thread scheduling approach outperforms full utilization by 12.7 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multicore Performance Prediction with MPET

Article Open access 01 July 2020

Hierarchical multicore thread mapping via estimation of remote communication

Article 31 October 2017

Using Machine Learning Techniques to Detect Parallel Patterns of Multi-threaded Applications

Article 04 November 2015

References

Huh J, Burger D, Keckler SW (2001) Exploring the design space of future CMPS. In: Proceedings of PACT 01, Washington, 8–12 September 2001, pp 199–210. IEEE, New York
Fan Z, Qiu F, Kaufman A, Yoakum-Stover S (2004) GPU cluster for high performance computing. In: Proceedings of SC 04, Washington, 6–12 November, p 47. IEEE, New York
Krüger J, Westermann R (2003) Linearalgebra operators for gpu implementation of numerical algorithms. ACM Trans Graph 22:908–916
Article Google Scholar
Levesque JM, Sankaran R, Grout R (2012) Hybridizing s3d into an exascale application using openacc: an approach for moving to multi-petaflops and beyond. In: Proceedings of SC 12, Washington, 10 November 2012, pp 1–11. IEEE, New York
Scott DS (2012) Intel many integrated core architecture for HPC. In: Proceedings of ATIP 12, Buona Vista, 7–10 May 2012. A*STAR Computational Resource Centre, Singapore
Sasaki H, Tanimoto T, Inoue K, Nakamura H (2012) Scalability-based manycore partitioning. In: Proceedings of PACT 12, New York, 19–23 November, pp 107–116. ACM, New York
Chen J, John LK (2009) Efficient program scheduling for heterogeneous multi-core processors. In: Proceedings of DAC 09, New York, 26 July 2009, pp 927–930. ACM, New York
Koufaty D, Reddy D, Hahn S (2010) Bias scheduling in heterogeneous multi-core architectures. In: Proceedings of EuroSys 10, New York, 13–16 April 2010, pp 125–138. ACM, New York
Li T, Brett P, Knauerhase R, Koufaty D, Reddy D, Hahn S (2010) Operating system support for overlapping-ISA heterogeneous multi-core architectures. In: Proceedings of HPCA 10, Bangalore, 9–14 January 2010, pp 1–12. IEEE, New York
Shelepov D, Saez Alcaide JC, Jeffery S, Fedorova A, Perez N, Huang ZF, Blagodurov S, Kumar V (2009) Hass: a scheduler for heterogeneous multicore systems. SIGOPS Oper Syst Rev 43:66–75
Article Google Scholar
Eyerman S, Du Bois K, Eeckhout L (2012) Speedup stacks: identifying scaling bottlenecks in multi-threaded applications. In: Proceedings of ISPASS 12, New Brunswick, 1–3 April 2012, pp 145–155. IEEE, New York
Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P (2008) Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems. In: Proceedings of HPCA 08, Salt Lake city, 16–20 February, pp 367–378. IEEE, New York
Xie Y, Loh G (2008) Dynamic classification of program memory behaviors in CMPS. In: Proceedings of CMP-MSI 08, in conjunction with ISCA 08, Beijing, 22 June 2008
Jaleel A, Najaf-abadi HH, Subramaniam S, Steely SC, Emer J (2012) Cruise: cache replacement and utility-aware scheduling. In: Proceedings of ASPLOS XVII 12, New York, 3–7 March 2012, pp 249–260. ACM, New York
Jin H, Frumkin M, Yan J (1999) The OpenMp implementation of NAS parallel benchmarks and its performance. In: Technical report. Technical report NAS-99-011, NASA Ames Research Center, Moffett Field, California
Intel. (2013) The tool of speedometer. https://github.com/01org/hpc-speedometer. Accessed 23 Sept 2013
Emma P (1997) Understanding some simple processor-performance limits. IBM J Res Dev 41:215–232
Article Google Scholar
Eyerman S, Eeckhout L, Karkhanis T, Smith JE (2006) A performance counter architecture for computing accurate CPI components. In: Proceedings of ASPLOS XII 06, New York, 21–25 October, pp 175–184. ACM, New York
Luque C, Moreto M, Cazorla FJ, Gioiosa R, Buyuktosunoglu A, Valero M (2009) ITCA: inter-task conflict-aware CPU accounting for CMPS. In: Proceedings of PACT 09, 12–16 September 2009, pp 203–213. IEEE, New York
Eyerman S, Eeckhout L (2009) Per-thread cycle accounting in SMT processors. In: Proceedings of ASPLOS XIV 09, New York, 7–11 March 2009, pp 133–144. ACM, New York
Ebrahimi E, Lee CJ, Mutlu O, Patt YN (2010) Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. In: Proceedings of ASPLOS XV 10, New York, 13–17 March 2010, pp 335–346. ACM, New York
Patsilaras G, Choudhary NK, Tuck J (2012) Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era. ACM Trans Archit Code Optim 8:28:1–28:21
Article Google Scholar
Van Craeynest K, Jaleel A, Eeckhout L, Narvaez P, Emer J (2012) Scheduling heterogeneous multi-cores through performance impact estimation (PIE). In: Proceedings of ISCA 12, Washington, 9–13 June 2012, pp 213–224. IEEE, New York
Touzene A, Al-Yahai S, AlMuqbali H, Bouabdallah A, Challal Y (2011) Performance evaluation of load balancing in hierarchical architecture for grid computing service middleware. Int J Comput Sci Issues 8(2):213–223
Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No.61173039 and No. 61202041), and the National High Technology Research and Development Program (863 Program) of China (No.2012AA010904 and No.2012AA01A306).

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China
Xiuxiu Bai, Xiaoshe Dong & Xingjun Zhang
The State Key Laboratory of High-end Server and Storage Technology, Jinan, China
Endong Wang

Authors

Xiuxiu Bai
View author publications
You can also search for this author inPubMed Google Scholar
Endong Wang
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoshe Dong
View author publications
You can also search for this author inPubMed Google Scholar
Xingjun Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Xiaoshe Dong.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bai, X., Wang, E., Dong, X. et al. A scalability prediction approach for multi-threaded applications on manycore processors. J Supercomput 71, 4072–4094 (2015). https://doi.org/10.1007/s11227-015-1505-x

Download citation

Published: 22 August 2015
Issue Date: November 2015
DOI: https://doi.org/10.1007/s11227-015-1505-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A scalability prediction approach for multi-threaded applications on manycore processors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multicore Performance Prediction with MPET

Hierarchical multicore thread mapping via estimation of remote communication

Using Machine Learning Techniques to Detect Parallel Patterns of Multi-threaded Applications

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now