Dynamic concurrency throttling on NUMA systems and data migration impacts

Schwarzrock, Janaina; Jordan, Michael Guilherme; Korol, Guilherme; Oliveira, Charles C. de; Lorenzon, Arthur F.; Beck Rutzig, Mateus; S. Beck, Antonio Carlos

doi:10.1007/s10617-020-09243-5

Dynamic concurrency throttling on NUMA systems and data migration impacts

Published: 04 November 2020

Volume 25, pages 135–160, (2021)
Cite this article

Design Automation for Embedded Systems Aims and scope Submit manuscript

329 Accesses
4 Citations
Explore all metrics

Abstract

Many parallel applications do not scale as the number of threads increases, which means that using the maximum number of threads will not always deliver the best outcome in performance or energy consumption. Therefore, many works have already proposed strategies for tuning the number of threads to optimize for performance or energy. Since parallel applications may have more than one parallel region, these tuning strategies can determine a specific number of threads for each application’s parallel region, or determine a fixed number of threads for the whole application execution. In the former case, strategies apply Dynamic Concurrency Throttling (DCT), which enables adapting the number of threads at runtime. However, the use of DCT implies on overheads, such as creating/destroying threads and cache warm-up. DCT’s overhead can be further aggravated in Non-uniform Memory Access systems, where changing the number of threads may incur in remote memory accesses or, more importantly, data migration between nodes. In this way, tuning strategies should not only determine the best number of threads locally, for each parallel region, but also be aware of the impacts when applying DCT. This work investigates how parallel regions may influence each other during DCT employment, showing that data migration may represent a considerable overhead. Effectively, those overheads affect the strategy’s solution, impacting the overall application performance and energy consumption. We demonstrate why many approaches will very likely fail when applied to simulated environments or will hardly reach a near-optimum solution when executed in real hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 3

A New Hardware Counters Based Thread Migration Strategy for NUMA Systems

Mitigating the NUMA effect on task-based runtime systems

Article 06 April 2023

Marcos Maroñas, Antoni Navarro, … Vicenç Beltran

Mapping Medley: Adaptive Parallelism Mapping with Varying Optimization Goals

Notes

The Linux policy for thread scheduling focus on balancing the work between available resources.
We classified the applications based on the rate of L3 cache misses per instructions. We used the Intel PCM tool to collect the misses and instructions counters.
First-Touch data mapping policy places the page on the node that is running the thread that causes its first page-fault [13].
Although UA is the most significant case of the group 4.1, for this application it became impossible to carry out a more detailed analysis for two reasons: first, the high number of parallel regions, makes the evaluation process extremely costly; and second, most regions are short and energy hardware counters are not precise considering periods of time smaller than 0.001 seconds [17].

References

Alessi F, Thoman P, Georgakoudis G, Fahringer T, Nikolopoulos DS (2015) Application-level energy awareness for OpenMP. Springer, Cham, pp 219–232
Google Scholar
Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatoohi RA, Frederickson PO, Lasinski TA, Schreiber RS, Simon HD, Venkatakrishnan V, Weeratunga SK (1991) The NAS parallel benchmarks—summary and preliminary results. In: ACM/IEEE CS. ACM, NY, USA , pp 158–165. https://doi.org/10.1145/125826.125925
Bari MAS, Chaimov N, Malik AM, Huck KA, Chapman B, Malony AD, Sarood O (2016) Arcs: adaptive runtime configuration selection for power-constrained openmp applications. In: 2016 IEEE international conference on cluster computing (CLUSTER), pp 461–470
Beck ACS, Lisbôa CAL, Carro L (2012) Adaptable embedded systems. Springer, Berlin
Google Scholar
Broquedis F, Aumage O, Goglin B, Thibault S, Wacrenier PA, Namyst R (2010) Structuring the execution of openmp applications for multicore architectures. In: 2010 IEEE international symposium on parallel & distributed processing (IPDPS). IEEE, pp 1–10 (2010)
Broquedis F, Furmento N, Goglin B, Wacrenier PA, Namyst R (2010) Forestgomp: an efficient openmp environment for numa architectures. Int J Parallel Prog 38(5–6):418–439
Article Google Scholar
Chadha G, Mahlke S, Narayanasamy S (2012) When less is more (LIMO): controlled parallelism for improved efficiency. In: CASES. USA, pp 141–150 (2012)
Corbet J. Toward better numa scheduling. https://lwn.net/Articles/486858/
Curtis-Maury M, Blagojevic F, Antonopoulos CD, Nikolopoulos DS (2008) Prediction-based power-performance adaptation of multithreaded scientific codes. IEEE Trans Parallel Distrib Syst 19(10):1396–1410
Article Google Scholar
Curtis-Maury M, Dzierwa J, Antonopoulos CD, Nikolopoulos DS (2006) Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In: Int CS, pp 157–166
Dashti M, Fedorova A, Funston J, Gaud F, Lachaize R, Lepers B, Quema V, Roth M (2013) Traffic management: a holistic approach to memory placement on numa systems. ACM SIGARCH Comput Archit News 41(1):381–394
Article Google Scholar
De Sensi D (2016) Predicting performance and power consumption of parallel applications. In: PDP, pp 200–207. https://doi.org/10.1109/PDP.2016.41
Diener M, Cruz EH, Alves MA, Navaux PO, Koren I (2016) Affinity-based thread and data mapping in shared memory systems. ACM Comput Surv (CSUR) 49(4):1–38
Article Google Scholar
Diener M, Cruz EH, Navaux PO (2015) Locality vs. balance: Exploring data mapping policies on numa systems. In: 2015 23rd Euromicro international conference on parallel, distributed, and network-based processing. IEEE, pp 9–16 (2015)
Diener M, Cruz EH, Navaux PO, Busse A, Heiß HU (2014) kmaf: automatic kernel-level management of thread and data affinity. In: Proceedings of the 23rd international conference on parallel architectures and compilation. ACM, pp 277–288 (2014)
Diener M, Cruz EH, Pilla LL, Dupros F, Navaux PO (2015) Characterizing communication and page usage of parallel applications for thread and data mapping. Perform Eval 88:18–36
Article Google Scholar
Hähnel M, Döbel B, Völp M, Härtig H (2012) Measuring energy consumption for short code paths using RAPL. SIGMETRICS Perform Eval Rev 40(3):13–17. https://doi.org/10.1145/2425248.2425252
Article Google Scholar
Joao JA, Suleman MA, Mutlu O, Patt YN (202) Bottleneck identification and scheduling in multithreaded applications. In: ASPLOS. ACM, NY, USA, pp 223–234. https://doi.org/10.1145/2150976.2151001
Jung C, Lim D, Lee J, Han S (2005) Adaptive execution techniques for SMT multiprocessor architectures. In: ACM symposium on principles and practice of parallel programming. USA , pp 236–246
Lee J, Wu H, Ravichandran M, Clark N (2010) Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. SIGARCH Comput Archit News 38(3):270–279
Article Google Scholar
Lepers B, Quéma V, Fedorova A (2015) Thread and memory placement on numa systems: asymmetry matters. In: 2015 USENIX annual technical conference (USENIX ATC 15), pp 277–289
Lorenzon AF, Beck ACS (2019) Parallel computing hits the power wall: principles, challenges, and a survey of solutions. Springer, Berlin
Book Google Scholar
Lorenzon AF, Cera MC, Beck ACS (2016) Investigating different general-purpose and embedded multicores to achieve optimal trade-offs between performance and energy. J Parallel Distrib Comput 95:107–123
Article Google Scholar
Lorenzon AF, Oliveira CCD, Souza JD, Filho ACSB (2018) Aurora: seamless optimization of OpenMP applications. In: IEEE transactions on parallel and distributed systems, pp 1–1. https://doi.org/10.1109/TPDS.2018.2872992
Lorenzon AF, Sartor AL, Cera MC, Beck ACS (2015) Optimized use of parallel programming interfaces in multithreaded embedded architectures. In: 2015 IEEE computer society annual symposium on VLSI. IEEE, pp 410–415
Lorenzon AF, Souza JD, Beck ACS (2017) LAANT: a library to automatically optimize EDP for OpenMP applications. In: DATE, pp 1229–1232. https://doi.org/10.23919/DATE.2017.7927176
McCalpin JD (1995) Memory bandwidth and machine balance in current high performance computers. In: IEEE computer society technical committee on computer architecture newsletter, pp 19–25 (1995)
Mucci PJ, Browne S, Deane C, Ho G (1999) Papi: a portable interface to hardware performance counters. In: Proceedings of the department of defense HPCMP users group conference, vol 710 (1999)
Petersen W, Arbenz P (2004) Introduction to parallel computing: a practical guide with examples in C. Oxford texts in applied and engineering mathematics. OUP, Oxford
Book Google Scholar
Porterfield AK, Olivier SL, Bhalachandra S, Prins JF (2013) Power measurement and concurrency throttling for energy reduction in OpenMP programs. In: IEEE IPDPS, pp 884–891
Pusukuri KK, Gupta R, Bhuyan LN (2011) Thread reinforcer: dynamically determining number of threads via OS level monitoring. In: IEEE ISWC. USA, pp 116–125
Quinn M (2004) Parallel programming in C with MPI and OpenMP. McGraw-Hill Higher Education, New York City
Google Scholar
Raasch SE, Reinhardt SK (2003) The impact of resource partitioning on SMT processors. In: PACT, pp 15–25. https://doi.org/10.1109/PACT.2003.1237998
Schwarzrock J, Lorenzon AF, Navaux PO, Beck ACS, de Freitas EP (2017) Potential gains in EDP by dynamically adapting the number of threads for openmp applications in embedded systems. In: 2017 VII Brazilian symposium on computing systems engineering (SBESC). IEEE, pp 79–85
Sensi DD, Torquati M, Danelutto M (2016) A reconfiguration algorithm for power-aware parallel applications. TACO 13(4):43-1–43-25. https://doi.org/10.1145/3004054
Article Google Scholar
Seo S, Jo G, Lee J (2011) Performance characterization of the NAS parallel benchmarks in OpenCL. In: IEEE ISWC, pp 137–148. https://doi.org/10.1109/IISWC.2011.6114174
Sridharan S, Gupta G, Sohi GS (2014) Adaptive, efficient, parallel execution of parallel programs. In: ACM SIGPLAN PLDI. ACM, NY, USA, pp 169–180
Subramanian L, Seshadri V, Kim Y, Jaiyen B, Mutlu O (2013) MISE: providing performance predictability and improving fairness in shared main memory systems. In: IEEE HPCA, pp 639–650
Suleman MA, Qureshi MK, Patt YN (2008) Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs. SIGARCH Comput Archit News 36(1):277–286
Article Google Scholar
Wang W, Davidson JW, Soffa ML (2016) Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale numa machines. In: 2016 IEEE international symposium on high performance computer architecture (HPCA). IEEE, pp 419–431 (2016)

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001, the Fundação de Amparo à Pesquisa do Estado do RS (FAPERGS) and the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).

Author information

Authors and Affiliations

Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Janaina Schwarzrock, Michael Guilherme Jordan, Guilherme Korol, Charles C. de Oliveira & Antonio Carlos S. Beck
Optimization Systems Laboratory, Federal University of Pampa, Alegrete, Brazil
Arthur F. Lorenzon
Federal University of Santa Maria, Santa Maria, Brazil
Mateus Beck Rutzig

Authors

Janaina Schwarzrock
View author publications
You can also search for this author in PubMed Google Scholar
Michael Guilherme Jordan
View author publications
You can also search for this author in PubMed Google Scholar
Guilherme Korol
View author publications
You can also search for this author in PubMed Google Scholar
Charles C. de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Arthur F. Lorenzon
View author publications
You can also search for this author in PubMed Google Scholar
Mateus Beck Rutzig
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Carlos S. Beck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Janaina Schwarzrock.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schwarzrock, J., Jordan, M.G., Korol, G. et al. Dynamic concurrency throttling on NUMA systems and data migration impacts. Des Autom Embed Syst 25, 135–160 (2021). https://doi.org/10.1007/s10617-020-09243-5

Download citation

Received: 28 April 2020
Accepted: 26 October 2020
Published: 04 November 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10617-020-09243-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Dynamic concurrency throttling on NUMA systems and data migration impacts

Abstract

Access this article

Similar content being viewed by others

A New Hardware Counters Based Thread Migration Strategy for NUMA Systems

Mitigating the NUMA effect on task-based runtime systems

Mapping Medley: Adaptive Parallelism Mapping with Varying Optimization Goals

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dynamic concurrency throttling on NUMA systems and data migration impacts

Abstract

Access this article

Similar content being viewed by others

A New Hardware Counters Based Thread Migration Strategy for NUMA Systems

Mitigating the NUMA effect on task-based runtime systems

Mapping Medley: Adaptive Parallelism Mapping with Varying Optimization Goals

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation